Ber - Dawe Ba 


Fp WS YP KE EAAAR 
S EAA 
SAWE: E., WE H 


As 





DENE: AL, PSMA s 4 mw 


APATHOBMEER TORE, SMUT EARS BENSON, HEMER 
ENRÄARZEHR, MFRS. SERV THE, HRZFEZERLNAN, Atma pK 
REET ZR. ANSTEIGEN EFT RT THAME, ARTNENENAMZ 
Al, FRRTESEGENEIERFERNT ZIERT. 


For sale and distribution in the People’ s Republic of China exclusively 
(except Hong Kong SAR, Macao SAR and Taiwan) . 
MRF REA RANREA (REBAR. RIFATHEMPRSSweE ) HERMIT. 








08049- 2 PEARSON 
ER pesoni com 
AM: KA ISBN 7-301-08049-2/C + 0295 
AUDAR mgit: Amt EBT: 73.0076 


aa a PE OFAI 
RIRH OP A E AAAA 
SFA 
CHWE: L. ES 


# 4 


(2) FJAR- BEXE # 


SER > 

PS) Ae a Y mat 

eS I, PEKING UNIVERSITY PRESS 
90 


{02} a Py ab Sh 





ACHR Th RA SEAL GS id BF :01-2005-1100 5 
AERA B (CIP) He 


SEE: ARSAN GE 4 ONERE (Gregory, R. JF. WER. 
IR AER AS th RE, 2005.3 

GMB AR OH RI ) 

ISBN 7-301-08049-2 


Lò Mge TDL ty BEY EC IV.B841.7 
HERABE CIP cH KS (2004 ) 8 099654 5 


English reprint edition copyright © 2004 by PEARSON EDUCATION ASIA LIMITED and PEKING 
UNIVERSITY PRESS. se 
Original English language title from Proprietor’s edition of the Work. 

Original English language title: Psychological Testing: History, Principles, and Applications, 

Robert J. Gregory, Copyright © 2004 

ISBN: 0205354726 

All Rights Reserved. 


Published by arrangement with the original publisher, Pearson Education, Inc., publishing as Allyn & Bacon, Inc. 
+ lec: tala Allyn\& Bacon, Inc. BLE AGI A Hi MORE HH WEAF 


edition i is authorized for sale and distribution only in the People’s Republic of China exclusively 
cee Hong der SAR, Macao SAR and Taiwan). 
FEHLEN (AAG GSE MISTER Oh RSE) MERT 
a. Wis A. pia 4 


rn <r 


E 2: DENE: ise, ik 5 BEA (SE A hh) 

(ER (EH: [34] Robert J. Gregory # 

RE Ha HB: KAA 

kk Æ # 3: ISBN 7-301-08049-2/C - 0295 

HM 8: IHRER 

ii H: ERTEK FRA IH AFRN 100871 

i HE: http://cbs.pku.edu.en E F14 :pw@pup.pku.edu.cn 

=I iB. WMHS 62752015 KT 62750672 msn 58874097 58874098 
ED ml FE a TT 
& za 
Z # 


S 
I 


: RKF H itt 

eh 
85032 Kx1168 ER 1697 46.125 HK 578 TF 
200542 3 H4 1k 200647 1 AS 2KERM 

TE #: 73.00 Jù 


iz 


ARAL AT A , RAED AF 
ABH WMA Pearson Education E AA HREAN EARE, CRESTS o 


tH Joe Be WH 


BSCS FR Ly BAF BE SCY EY FI SS AE H b BE ER YT AE A AK LB 
FRANENZHE HET Eo HI AS BE SCR A BO , ER LTE 
H FRE As BR fick BRT DR BY — FTA 

as Be BEA Je, EAS A BY a LE ak AH he HE , A LE EE EY HQ 
N, Ay ee Se ag Se a A al) PB SAS A Re ELS o 


JERR A Sy htt 
2005 3 A 





OAW REE FT ILE EKYN O 
RY Se ese, He) Re — PAR 
BL, BERERA Re LE Ty tt R 
TER FFE RER: TE Mi AE EAS 
BP TE BN AAR RE, SRE Cy BY Tet BY Og 
HREH AY Ar Fr Ta Ta , MAAA A BI 


(remediation ) , 


BABS 

B 4 MAERT OERE S LEE 
SDA NR. Ap RAZ 
FIT EBT EM AS RA LAER BEE BY 
FEE, BBR TERT al. Hin PARTE 
SEAT ARE T b BEY tt RER Tl 
ER , UV AS Ras VEME AE SOE AT BR ht YF] 
FL. ED BPR MET SK, BRIE T 
EN Bom TA- ER A ALA a dk 
FEE TIRE 


Oe 

BR T BE ee ty FW E E BE [A] EL Sb , 
KiEA BTA ST —ERUAMRA T ED 
FEW BE Fo EY] AS. An, ARS 


BENAR eb BEM ENDE, Lg 
FU A BEY Bet A SB ES SE BA 
BRERA. Haz, RARA TS feb 
WERE, TESTER CHEN 
Th. HA, DHEA AA ANS T F 
PE, Ay BE DESY Fi AS ER 
ASIA Miia, MAMFZTANZSHN 
ERKA -o AM , BR Ais BED Sal BY BE BE A He 
— FFE BCBS OSE EBT O EW at E 
BER ROAR EAB IFES 
SRN RAK o 

Ly ALM BE ARB PAS A Vi $Y. BH — 
HRA. HPPA EWA ER be A — 
RA SOW ) A BK. Ak , HWA 
UN, ier BE ABC RR Hs He BE Se A hs BE A AL 
Allo ATBAHLTERNUR , PEAS BF he 
RATA — BO PER Ha A HP — 2 
A A, BRM Sw aS MS 
MER, Aa AA eR 
zu. KREA E T BRL E A) HA ab BR 
SF ANGE GS ty ME, AF ld SS YL 
Ti WRAKR 

ARIE AP — FE , TP AH Aly BM et, 


TERN TAX, ARE RH 
I ARR APR. BR IN HE 
Rk, CARD PMY EB OMS 
—, BENA ET A ST RHA DRM 
RMOWHAB MW. ene Ak AWA 
Zi, REX-HMOHLGZAMT HA 
RE, Sh AT tp BL A 
EEA, EAU Jab A Kh AB ee ET 
HRRZM WKAR. Rit LAM RT 
RAY AWMAME, BAMA 
KAAS HA lel, BAKA 
AWEMEF AUER A Alt, 

ABB AY ALFERD EE AE, 
(RE, BWRIWES DRM EA 
KBE. ob BMA PN 
Kr, AMMAR AT ERE ASE 
WHEMSE MER. RT OEN 
PE WSN, ASB AU AER T AFA 
BAE Jy fy i AE AH HE AK AN Tl BRI 
46 EPA UR, WERA Sb BE H 
J, Ea AG RE A RR EEE, BE 
WW TERE ETE AOE, UZb I 
ARTE. 


OF ae Bs ene 

AS RARE RAT ATT HE RIA HY “SR Bil” 
AE, LSS A) E PR E BES, 
ZB Ws BH AB AR EB) UF obs Ee) DR 
Ro KEBRAURTA RSA isk 
FARHA A EA MA ERAR Ht 
N. BEA AT BE BAA BY SR ART ERN E 
DEF ENEMA PRE, A 
HRA HARKE. GR ATRE TAM 


PLAT HRA, RER AA RE ESE AR TE N 
Fit, KA TRETERMANT. 


NE ed hk N 


AB Wa TE SS = Wie EN BU T AR 
EWBH MHRA, VIER 
PBS. WM, ETA T BBS H HT 
EIT RU RAK ADF GEB — HR 
WH FL ,PPVT-II, DIAL-III, Con- 
ner’s Rating Scales, PIC-2 ). Mab, X F 
EI BKENRR, KRENA TRS 
AA EBL EM H i FE CIB BR NE 
SRBC”), CARMATE AM RRS 
VA BFF Yi Bee HY tr HERR AY o 

VAR FEE AS Wi HE A BREI N AF RS 
A: 

o “4 MAB RHR” mT TRA 
RMB. 

e“FMSA KT RAWLS MH" H 
MIATHAVZEMLEAR, 

e “$M 5B: BILE FH ILE HDA Hl 
E” HET -—AMw—_*#AILAA 
444% 3% (The Neonatal Behavioral As- 
sessment Scale). 

“FRM OA: FAHAAME’ PATH 
WISC-IIL HFS ANERE LH 
HAR 

oF MTA HRA MS" BAT R 
SMHARRABHS HME; HH, 
ERT HRS E(RRER)ARP H 
ZHBIAHNZERM, 

“2M SB RRHAAME’ T A 





H fe hae N ERAR Lexile 
HR. 

o “+M 12B: AR PUR! MH ASA 
E AmI -AER PBK 
HME, 

o “+H 1B:HAMNZAMRLER”N 
BTIHANZTRR-IKHAASMA 
Ae, Bp AS RAT MS, 

“FM DA: HAMM ES MEH AR” 
HATARA ERA NAT. 


AT LDR ty I EET HER, ERE 
HEIT T AV) ER ERIZ o HEA EB 
TREES , DR ate Cols BL RF HE 
TU AFF Ay LIE) 2003 EBERT AA; BEE 
TES RURAL MET HY RW TTI 


Bah BAER 


ORMAR 

Jy TAS AG AE TH TST RS NE, 
BR HE by BEW E A ES AE HE MESN 
FA) APT 30 TAY TRE ERR 
R3I0TF EMAAR. A KASH 
A 1S HM BRAT EA. X 
He RHAA TEA T APRS ae 
RN, BE FERN ASE EH ES TT FC a 
FE + EAR MSR Bl S o 

A REA BAT oe Ea. 
BEB ARK, FOR -EREN 
BEAM. WER, APT Ea 
AAAS, TURES T EMH AAAS 
AT LA Aik 56. BUA RUE EA 
ZHTVERNERZZHRETMZER 


BT 


HAF. RSIRFEFENEBER SB 
BE ARR T fF BAS ARE HY BR, PE 
WKE , E eZ BE A TI EH DY BR, RA 
DRT. TXAM F ARARA, H A 
$8 FR EE Ary ORW BER RY BR he BS, KR 
TELE PET SEEM MIAN A 


Oh 
AS RMNAAAAW PATHE: 


pE 

B1% DRM he 
EM 1A: Ss BM SR 
+M 1B:žX AH PRMSAME 


JER AIR 

#2 WERAWERA 
2M 2A; VENEN RRS BA 
ZAB:N Sit FZ 

83: RRM 
EM 3A: FRA DM EHEN 
EA 3B: 15 RRS 

8 4 REM it 
EM 4A: MEH RABE 
+ B:N EHEM 


73 5 ETI 

B58: W |: BeS AM e 
EMSA: KF FAH BH! MH 
ZAM SB: BILE FH ILE AH MS 


B 6R EWE U MAS AA 
2M 6A: 8 AGES 
A OB: FAH AAME 


TEN EREIFAFR NE 
ZR TA: HRA RBM SE 
ZAM 7B:0) EI Žž 5# viž 


m} 


3. 


#8 ASESIR A 
ZAA:AANSSEAFIH 
ZAB:RAHFANM 


RHEIN 
89°F MA LHAS PAHs 
ZUM HAS BF HK awh 
EM OB: HASREHSFMHME 
$10: DHRWENRARE 
ZU 1A: FRM 
ZA1B: SEN EHEM 
Bl: II SAHNE 
+ 1A: AAN EHE 
Z41B: LAMPE 
312: SE MESA S 
EM 12A: RLAR PKB HAM 
Hy RF 
<M 12B: AKU! MHA SME 
ARM 
13H: APM eR 
M 13A:ABMH LES Dl iF 
EM 13B: HRR 


UHREN 
EM 14A: KEŠ 
+M 1B: HIN ERAHZER 
FEB A DEN RN 
15 Bt WW PR E 
ZM15A:H+ FMM SHMSHARK 
ZAM DB: MEPHCBSHAMA | 


AS BY a AE Tk Ek T E 


RRLEWRR BEE. KAMER ATU 
AGE EL AY ahs BEA E Fe PE ee HE HB) 
EB. 


msi 


Fe AG Ra A FEER A a ER 
$e PEAY SC HFA #8 BH. Kelly May EAH IT 
AIT BBE, AT TIRE Be ia A R HEY 
EN, MAb BARAMAA RAI KE 
FEL At TEE AE EP FS BB PS 22 A RE 
mo BIETEN IR S| SC HA 

Kar, VAR FRAME AS Se HB He k 
TBD AR T A A ee a: 


My AR EEX $44 George M. Alliger 

AFF RAK F 4H Linda J. Allred 

Jn Fi) 4 JE E M ZA F (Fullerton) #9 Kay 
Bathurst 

R Ah EN ZK FH Fred Brown 

% KZ ty Michael L. Chase 

BIT RB KF La Crosse 2# # Milton 
J. Dehn 

SPAR PS K FH Timothy S. Hartshorne 

4S SIF K FH Herbert W. Helm 

BIT BM 4 KF 44 Ted Jaeger 

ABW; HN KF th Richard Kimball 

Haig J. Kojian 

Hk KF #9 Phyllis M. Ladrigan 

Ao Fi) #4 EN SKF (Fresno) #4 Terry 


JS AS Aa EE Be BY AR eA BA oR, DA 
PK Ul Mo AW eA, HEHE TR 
SED A PRR. HAAREN 
TERRBENRA, CI HT DWE 


G. Newell 
AT KZ Walter L. Porter 
SUNY (Brockport)#] Linda Krug Porzelius 
RAK FH Robert W. Read 


A EA KF I Robert A. Reeves 
A At K 69 James R. Sorensen 
MAAS kK F 64 Billy Van Jones 


RANA ARNE tell FE BEY EN IH 
MA FRA TESTI BAe AT BLA RM 
Fra] Sy RH H HEE T EE RA 
DR. BWA FEARR EDERA 


TAKE IM :Timothy Chaddock, John 
Laskowski #ll Cassie Hornbeck, 


BE, RR RH Mary, Sara Al Anne, 
ATT RES ES EAR TAH L. REN 
ZIENFRIN EN, SRR AR 
“Beit LEE PE” AY RA EE BI KF SH 
Ma. 


| Preface 


Po. testing began as a timid enter- 
prise in the scholarly laboratories of nine- 
teenth-century European psychologists. From this 
inauspicious birth, the practice of testing prolifer- 
ated throughout the industrialized world at an ever 
accelerating pace. As the reader will discover 
within the pages of this book, psychological test- 
ing now impacts virtually every corner of modern 
life, from education to vocation to remediation. 


I purpose or THE BOOK 


The fourth edition of this book is based upon the 
same assumptions as earlier versions. Its ambitious 
purpose is to provide the reader with knowledge 
about the characteristics, objectives, and wide- 
ranging effects of the consequential enterprise, 
psychological testing. In pursuit of this goal, I 
have incorporated certain well-worn traditions but 
proceeded into some new directions as well. For 
example, in the category of customary traditions, 
the book embraces the usual topics of norms, stan- 
dardization, reliability, validity, and test construc- 
tion. Furthermore, in the standard manner, I have 
assembled and critiqued a diverse compendium of 
tests and measures in such traditional areas as in- 
tellectual, achievement, industrial-organizational, 
vocational, and personality testing. 


Special Features 


In addition to the traditional topics previously 
listed, I have emphasized certain issues, themes, 
and concepts that are, in my opinion, essential for 
an in-depth understanding of psychological test- 
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ing. For example, the book opens with a chapter on 
the history of psychological testing. The place- 
ment of this chapter underscores my view that the 
history of psychological testing is of substantial 
relevance to present-day practices. Put simply, a 
mature comprehension of modern testing can be 
obtained only by delving into its heritage. Of 
course, students of psychology typically shun his- 
torical matters because these topics are often pre- 
sented in a dull, dry, and pedantic manner, devoid 
of relevance, to the present. However, I hope the 
skeptical reader will approach my history chapter 
with an open mind—I have worked hard to make 
it interesting and relevant. 

Psychological testing represents a contract be- 
tween two persons. One person—the examiner— 
usually occupies a position of power over the other 
person—the examinee. For this reason, the exam- 
iner needs to approach testing with utmost sensi- 
tivity to the needs and rights of the examinee. To 
emphasize this crucial point, I have devoted an 
early chapter to the subtleties of the testing pro- 
cess, including such issues as establishing rapport 
and watching for untoward environmental influ- 
ences upon test results. The last topic in the book 
also emphasizes the contractual nature of assess- 
ment by reviewing professional issues and ethical 
standards in testing. 

Another topic emphasized in this book is neuro- 
psychological assessment, a burgeoning subfield of 
clinical psychology that is now a well-established 
specialty in its own right. Neuropsychological as- 
sessment is definitely a growth area and now con- 
stitutes one of the major contemporary applications 
of psychological testing. I have devoted an entire 
chapter to this important subject. So that the reader 
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can better appreciate the scope and purpose of neuro- 
psychological assessment, I begin the chapter with 
a succinct review of neurological principles before 
discussing specific instruments. Tangentially, this 
review introduces important concepts in neuro- 
psychological assessment such as the relationship 
between localized brain dysfunction and specific 
behavioral symptoms. Nonetheless, readers who 
need to skip the section on neurological under- 
pinnings of behavior may do so with minimal 
loss—the section on neuropsychological tests and 
procedures is comprehensible in its own right. 
This is more than a book about tests and their 
reliabilities and validities. I also explore numerous 
value-laden issues bearing on the wisdom of testing. 
Psychological tests are controversial precisely be- 
cause the consequences of testing can be harmful, 
certainly to individuals and perhaps to the entire so- 
cial fabric as well. I have not ducked the controver- 
sies surrounding the use of psychological tests. 
Separate topics explore genetic and environmental 
contributions to intelligence, origins of race differ- 
ences in IQ, test bias and extravalidity concerns, 
cheating on group achievement tests, courtroom tes- 
timony, and ethical issues in psychological testing. 


Note on Case Exhibits 


This edition continues the use of case histories and 
brief vignettes that feature testing concepts and il- 
lustrate the occasionally abusive application of 
psychological tests. These examples are “boxed” 
and referred to as Case Exhibits. Most are based on 
my personal experience rather than scholarly un- 
dertakings. All of these case histories are real. The 
episodes in question really happened—I know be- 
cause I have direct knowledge of the veracity of 
each anecdote. These points bear emphasis because 
the reader will likely find some of the vignettes to 
be utterly fantastical and almost beyond belief. Of 
course, to guarantee the privacy of persons and 
institutions, I have altered certain unessential de- 
tails while maintaining the basic thrust of the 
original events. 


CHANGES FROM THE 
THIRD EDITION 


A revised edition should strive to add the latest 
findings about specific tests and incorporate new 
thinking on old concepts. I have done both. For ex- 
ample, several tests have been revised since the 
last edition (e.g., Stanford-Binet: Fifth Edition, 
PPVT-III, DIAL-III, Conners’ Rating Scales, PIC- 
2, to name just a few), and I have described the 
newest editions and included the relevant research. 
Also, some psychometric concepts have gained 
new prominence, and I have included updated cov- 
erage of these. A case in point is item response the- 
ory (also known as latent trait theory), which is 
slowly becoming the standard model for test con- 
ceptualization and development. 

In sum, the improvements and enhancements 
in the current edition include the following: 


1. Item response theory is now included in Topic 
3B: Concepts of Reliability. 

2. Information-processing theories of intelligence 
have been added to Topic 5A: Theories and the 
Measurement of Intelligence. 

3. A new test, the Neonatal Behavioral Assess- 
ment Scale, is presented in Topic 5B: Assess- 
ment of Infant and Preschool Abilities. 

4. New research on the meaning (or lack thereof) 
of the WISC-III Freedom from Distractibility 
factor has been added to Topic 6A: Individual 
Tests of Intelligence. 

5. Coverage of the assessment of persons who are 
deaf or hard of hearing has been expanded in 
Topic 7A: Testing Special Populations; this sec- 
tion also includes upgraded coverage of the 
Americans with Disabilities Act. 

6. The Lexile framework, an important new de- 
velopment in the assessment of reading skills, 
is introduced in Topic 8B: Group Tests of 
Achievement. 

7. A completely new section on attitudes and their 
assessment has been added in Topic 12B: Atti- 
tudes and the Assessment of Moral and Spiri- 
tual Concepts. 


8. Ecological momentary assessment, an exciting 
new development in behavioral assessment, has 
been added to Topic 14B: Behavioral Assess- 
ment and Related Approaches. 

9. The coverage of actuarial versus clinical pre- 
diction has been expanded in Topic 15A: Com- 
puterized Assessment and the Future of Testing. 


Of course, minor but essential changes have been 
made throughout the entire book to capture the 
latest developments in testing. Some of these 
changes include updating ethical concepts to in- 
clude the 2003 revision of the Ethical Principles of 
Psychologists and Code of Conduct; adding new 
research on violence risk assessment; and expand- 
ing the coverage of virtual reality and video-based 
assessment approaches. 


[||| OUTLINE or THE BOOK 
Topical Organization 


To accommodate the widest possible audience, I 
have incorporated an outline that splits the gargan- 
tuan field of psychological testing—its history, 
principles, and applications—into 30 small, man- 
ageable, modular topics. I was intrigued to discover 
that the 30 topics generally fell into natural pair- 
ings. Thus, the reader will notice that the book is 
also organized as an ordered series of 15 chapters 
of two topics each. The chapter format helps iden- 
tify pairs of topics that are more or less contiguous 
and reduces the need for redundant preambles to 
each topic. 

The most fundamental and indivisible unit of 
the book is the topic. Each topic stands on its own. 
In each topic, the reader encounters a manageable 
number of concepts and reviews a modest number 
of tests. To the student, the advantage of topical or- 
ganization is that the individual topics are small 
enough to read at a single sitting. To the instructor, 
the advantage of topical organization is that sub- 
jects deemed of lesser importance can be easily ex- 
cised from the reading list. Naturally, I would 
prefer that every student read every topic, but I am 


[4 


PREFACE xxiii 


a realist too. Often, a foreshortened textbook is 
necessary for practical reasons such as the length 
of the school term. In those instances, the instruc- 
tor will find it easy to fashion a subset of topics to 
meet the curricular needs of almost any course in 
psychological testing. 


Basic Outline 


The 15 chapters break down into six broad areas, 
as follows: 


History 
Chapter 1: The History of Psychological 
Testing 
Topic 1A: The Origins of Psychological 
Testing 
Topic 1B: Early Testing in the United States 


Foundations 

Chapter 2: Tests and the Testing Process 
Topic 2A: The Nature and Uses of 

Psychological Tests 

Topic 2B: The Testing Process 

Chapter 3: Norms and Reliability 
Topic 3A: Norms and Test Standardization 
Topic 3B: Concepts of Reliability 

Chapter 4: Validity and Test Development 
Topic 4A: Basic Concepts of Validity 
Topic 4B: Test Construction 


Intellectual and Ability Testing 
Chapter 5: Intelligence Testing I: Theories 
and Preschool Assessment 
Topic 5A: Theories and the Measurement 
of Intelligence 
Topic 5B: Assessment of Infant and 
Preschool Abilities 
Chapter 6: Intelligence Testing II: Individual 
and Group Tests 
Topic 6A: Individual Tests of Intelligence 
Topic 6B: Group Tests of Intelligence 
Chapter 7: Test Bias and Testing Special 
Populations 
Topic 7A: Testing Special Populations 
Topic 7B: Test Bias and Other Controversies 
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Chapter 8: Group Tests of Aptitude and 
Achievement 
Topic 8A: Aptitude Tests and Factor 
Analysis 
Topic 8B: Group Tests of Achievement 


Specialized Applications 
Chapter 9: Neuropsychological and Geriatric 
Assessment 
Topic 9A: A Primer of Neuropsychology 
Topic 9B: Neuropsychological and 
Geriatric Assessment 
Chapter 10: Special Settings for 
Psychological Assessment 
Topic 10A: School-Based Assessment 
Topic 10B: Forensic Applications of 
Assessment 
Chapter 11: Industrial and Organizational 
Assessment 
Topic 11A: Personnel Assessment and 
Selection 
Topic 11B: Appraisal of Work Performance 
Chapter 12: Attitudes, Interests, and Values 
Assessment 
Topic 12A: Interests and Values in 
Vocational Assessment 
Topic 12B: Attitudes and the Assessment 
of Moral and Spiritual Concepts 


Personality Testing 
Chapter 13: Origins of Personality Testing 
Topic 13A: Theories and the Measurement 
of Personality 
Topic 13B: Projective Techniques 
Chapter 14: Structured Personality Assessment 
Topic 14A: Self-Report Inventories 
Topic 14B: Behavioral Assessment and 
Related Approaches 


Computer-Aided Assessment and 
Professional Issues 
Chapter 15: Special Topics and Issues in 
Testing 
Topic 15A: Computerized Assessment and 
the Future of Testing 
Topic 15B: Ethical and Social Issues in 
Testing 


The book also features an extensive glossary, ap- 
pendices for locating tests and publishers, and a 
table for converting percentile ranks to standard 
and standardized-score equivalents. In addition, an 
important feature is Appendix A, titled Major 
Landmarks in the History of Psychological Testing. 
To meet personal needs, readers and course in- 
structors will pick and choose from these topics as 
they please. 
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HAPTER 





Topic 1A 


The History of 
Psychological Testing 


The Origins of Psychological Testing 


The Importance of Testing 


Case Exhibit 1.1 The Consequences of Test Results 
Rudimentary Forms of Testing in China in 2200 B.c. 
Psychiatric Antecedents of Psychological Testing 

The Brass Instruments Era of Testing 

Changing Conceptions of Mental Retardation in the 1800s 
Influence of Binet’s Early Research upon His Test 

Binet and Testing for Higher Mental Processes 

The Revised Scales and the Advent of IQ 


Summary 


T: history of psychological testing is a fasci- 
nating story and has abundant relevance to 
present-day practices. After all, contemporary tests 
did not spring from a vacuum; they evolved slowly 
from a host of precursors introduced over the last 
one hundred years. Accordingly, Chapter 1 features 
a review of the historical roots of present-day psy- 
chological tests. In Topic 1A, The Origins of Psy- 
chological Testing, we focus largely on the efforts 
of European psychologists to measure intelligence 
during the late nineteenth century and pre-World 
War I era. These early intelligence tests and their 


successors often exerted powerful effects on the 
examinees who took them, so the first topic also 
incorporates a brief digression documenting the 
pervasive importance of psychological test results. 
Topic 1B, Early Testing in the United States, cata- 
logues the profusion of tests developed by Ameri- 
can psychologists in the first half of the twentieth 
century. 

Psychological testing in its modern form origi- 
nated little more than one hundred years ago in lab- 
oratory studies of sensory discrimination, motor 
skills, and reaction time. The British genius Francis 
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Galton (1822-1911) invented the first battery of 
tests, a peculiar assortment of sensory and motor 
measures, which we review in the following. The 
American psychologist James McKeen Cattell 
(1860-1944) studied with Galton and then, in 1890, 
proclaimed the modern testing agenda in his classic 
paper entitled “Mental Tests and Measurements.” 
He was tentative and modest when describing the 
purposes and applications of his instruments: 


Psychology cannot attain the certainty and exact- 
ness of the physical sciences, unless it rests on a 
foundation of experiment and measurement. A step 
in this direction could be made by applying a series 
of mental tests and measurements to a large num- 
ber of individuals. The results would be of consid- 
erable scientific value in discovering the constancy 
of mental processes, their interdependence, and 
their variation under different circumstances. Indi- 
viduals, besides, would find their tests interesting, 
and, perhaps, useful in regard to training, mode of 
life or indication of disease. The scientific and 
practical value of such tests would be much in- 
creased should a uniform system be adopted, so 
that determinations made at different times and 
places could be compared and combined. (Cattell, 
1890) 


Cattell’s conjecture that “perhaps” tests would 
be useful in “training, mode of life or indication of 
disease” must certainly rank as one of the prophetic 
understatements of all time. Anyone reared in the 
Western world knows that psychological testing 
has emerged from its timid beginnings to become 
a big business and a cultural institution that per- 
meates modern society. To cite just one example, 
consider the number of standardized achievement 
and ability tests administered in the school systems 
of the United States. Although it is difficult to ob- 
tain exact data on the extent of such testing, an es- 
timate of 200 million per year is probably not 
extreme (Medina & Neill, 1990). Of course, the 
total number of tests administered yearly also in- 
cludes millions of. personality tests and untold 
numbers of the thousands of other kinds of tests 
now in existence (Conoley & Kramer, 1989, 1992; 
Mitchell, 1985; Sweetland & Keyser, 1987). There 
is no doubt that testing is pervasive. But does it 
make a difference? 
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Me IMPORTANCE OF TESTING 


Tests are used in almost every nation on earth for 
counseling, selection, and placement. Testing oc- 
curs in settings as diverse as schools, civil service, 
industry, medical clinics, and counseling centers. 
Most persons have taken dozens of tests and 
thought nothing of it. Yet, by the time the typical in- 
dividual reaches retirement age, it is likely that psy- 
chological test results will help shape his or her 
destiny. The deflection of the life course by psy- 
chological test results might be subtle, such as 
when a prospective mathematician qualifies for an 
accelerated calculus course based on tenth-grade 
achievement scores. More commonly, psychologi- 
cal test results alter individual destiny in profound 
ways. Whether a person is admitted to one college 
and not another, offered one job but refused a sec- 
ond, diagnosed as depressed or not—all such de- 
terminations rest, at least in part, on the meaning of 
test results as interpreted by persons in authority. 
Put simply, psychological test results change lives. 
For this reason it is prudent—indeed, almost 
mandatory—that students of psychology learn 
about the contemporary uses and occasional abuses 
of testing. In Case Exhibit 1.1, the life-altering af- 
termath of psychological testing is illustrated by 
means of several true case history examples. 

The importance of testing is also evident from 
historical review. Students of psychology generally 
regard historical issues as dull, dry, and pedantic, 
and sometimes these prejudices are well deserved. ` 
After all, many textbooks fail to explain the rele- 
vance of historical matters and provide only vague 
sketches of early developments in mental testing. 
As a result, students of psychology often conclude 
incorrectly that historical issues are boring and 
irrelevant. 

In reality, the history of psychological testing is 
a captivating story that has substantial relevance to 
present-day practices. Historical developments are 
pertinent to contemporary testing for the following 
reasons: 


1. A review of the origins of psychological testing 
helps explain current practices that might other- 
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wise seem arbitrary or even peculiar. For exam- 
ple, why do many current intelligence tests in- 
corporate a seemingly nonintellective capacity, 
namely, short-term memory for digits? The an- 
swer is, in part, historical inertia—intelligence 
tests have always included a measure of digit 
span. 

. The strengths and limitations of testing also stand 
out better when tests are viewed in historical con- 
text. The reader will discover, for example, that 





modern intelligence tests are exceptionally good 
at predicting school failure—precisely because 
this was the original and sole purpose of the first 
such instrument developed in Paris, France, at 
the turn of the twentieth century. 


. Finally, the history of psychological testing con- 


tains some sad and regrettable episodes that 
help remind us not to be overly zealous in our 
modern-day applications of testing. For exam- 
ple, based on the misguided and prejudicial 
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application of intelligence test results, several 
prominent psychologists helped ensure passage 
of the Immigration Restriction Act of 1924. 


In later chapters, we examine the principles of 
psychological testing, investigate applications in 
specific fields (e.g., personality, intelligence, 
neuropsychology), and reflect on the social and 
legal consequences of testing. However, the reader 
will find these topics more comprehensible when 
viewed in historical context. So, for now, we begin 
at the beginning by reviewing rudimentary forms 
of testing that existed over four thousand years ago 
in imperial China. 


RUDIMENTARY FORMS OF TESTING 
IN CHINA IN 2200 B.c. 


Although the widespread use of psychological test- 
ing is largely a phenomenon of the twentieth cen- 
tury, historians note that rudimentary forms of 
testing date back to at least 2200 B.c. when the Chi- 
nese emperor had his officials examined every third 
year to determine their fitness for office (Bowman, 
1989; Chaffee, 1985; DuBois, 1970; Franke, 1963; 
Lai, 1970; Teng, 1942-43). Such testing was modi- 
fied and refined over the centuries until written 
exams were introduced in the Han dynasty (202 
B.C.-A.D. 200). Five topics were tested: civil law, 
military affairs, agriculture, revenue, and geography. 

The Chinese examination system took its final 
form about 1370 when proficiency in the Confucian 
classics was emphasized. In the preliminary exam- 
ination, candidates were required to spend a day 
and a night in a small isolated booth, composing es- 
says on assigned topics and writing a poem. The 1 
to 7 percent who passed moved up to the district 
examinations, which required three separate ses- 
sions of three days and three nights. 

The district examinations were obviously gruel- 
ing and rigorous, but this was not the final level. The 
1 to 10 percent who passed were allowed the privi- 
lege of going to Peking for the final round of exam- 
inations. Perhaps 3 percent of this final group passed 
and became mandarins, eligible for public office. 

Although the Chinese developed the external 
trappings of a comprehensive civil service exami- 
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nation program, the similarities between their tra- 
ditions and current testing practices are, in the 
main, superficial. Not only were their testing prac- 
tices unnecessarily grueling, the Chinese also failed 
to validate their selection procedures. Nonetheless, 
it does appear that the examination program incor- 
porated relevant selection criteria. For example, in 
the written exams beauty of penmanship was 
weighted very heavily. Given the highly stylistic 
features of Chinese written forms, good penman- 
ship was no doubt essential for clear, exact com- 
munication. Thus, penmanship was probably a 
relevant predictor of suitability for civil service em- 
ployment. In response to widespread discontent, 
the examination system was abolished by royal de- 
cree in 1906 (Franke, 1963). 


PSYCHIATRIC ANTECEDENTS 
OF PSYCHOLOGICAL TESTING 


Most historians trace the beginnings of psycholog- 
ical testing to the experimental investigation of in- 
dividual differences that flourished in Germany and 
Great Britain in the late 1800s. There is no doubt 
that early experimentalists such as Wilhelm Wundt, 
Francis Galton, and James McKeen Cattell laid the 
foundations for modern-day testing, and we will re- 
view their contributions in detail. But psychologi- 
cal testing owes as much to early psychiatry as it 
does to the laboratories of experimental psychol- 
ogy. In fact, the examination of the mentally ill 
around the middle of the nineteenth century re- 
sulted in the development of numerous early tests 
(Bondy, 1974). These early tests featured the ab- 
sence of standardization and were consequently 
relegated to oblivion. They were nonetheless influ- 
ential in determining the course of psychological 
testing, so it is important to mention a few typical 
developments from this era. 

In 1885, the German physician Hubert von 
Grashey developed the antecedent of the memory 
drum as a means of testing brain-injured patients. 
His subjects were shown words, symbols, or pic- 
tures through a slot in a sheet of paper that was 
moving slowly over the stimuli. Grashey found that 
many patients could recognize stimuli in their to- 
tality but could not identify them when shown 
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through the moving slot. Shortly thereafter, the 
German psychiatrist Conrad Rieger developed an 
excessively ambitious test battery for brain dam- 
age. His battery took over 100 hours to administer 
and soon fell out of favor. 

In summary, early psychiatry contributed to the 
mental test movement by showing that standard- 
ized procedures could help reveal the nature and 
extent of symptoms in the mentally ill and brain- 
injured patients. Most of the early tests developed 
by psychiatrists faded into oblivion, but a few pro- 
cedures were standardized and perpetuate them- 
selves in modern variations (Bondy, 1974). 

THE BRASS INSTRUMENTS 

ERA OF TESTING 
Experimental psychology flourished in the late 
1800s in continental Europe and Great Britain. For 
the first time in history, psychologists departed 
from the wholly subjective and introspective meth- 
ods that had been so fruitlessly pursued in the pre- 
ceding centuries. Human abilities were instead 
tested in laboratories. Researchers used objective 
procedures that were capable of replication. Gone 
were the days when rival laboratories would have 
raging arguments about “imageless thought,” one 
group saying it existed, another group saying that 
such a mental event was impossible. 

Even though the new emphasis on objective 
methods and measurable quantities was a vast im- 
provement over the largely sterile mentalism that 
preceded it, the new experimental psychology was 
itself a dead end, at least as far as psychological 
testing was concerned. The problem was that the 
early experimental psychologists mistook simple 
sensory processes for intelligence. They used as- 
sorted brass instruments to measure sensory thresh- 
olds and reaction times, thinking that such abilities 
were at the heart of intelligence. Hence, this period 
is sometimes referred to as the Brass Instruments 
era of psychological testing. 

In spite of the false start made by early experi- 
mentalists, at least they provided psychology with 
an appropriate methodology. Such pioneers as 
Wundt, Galton, Cattell, and Clark Wissler showed 
that it was possible to expose the mind to scientific 











scrutiny and measurement. This was a fateful 
change in the axiomatic assumptions of psychol- 
ogy, a change that has stayed with us to the current 
day. 

Most sources credit Wilhelm Wundt (1832— 
1920) with founding the first psychological labora- 
tory in 1879 in Leipzig, Germany. It is less well 
recognized that he was measuring mental processes 
years before, at least as early as 1862, when he ex- 
perimented with his thought meter (Diamond, 
1980). This device was a calibrated pendulum with 
needles sticking off from each side. The pendulum 
would swing back and forth, striking bells with the 
needles. The observer’s task was to take note of the 
position of the pendulum when the bells sounded. 
Of course, Wundt could adjust the needles before- 
hand and thereby know the precise position of the 
pendulum when each bell was struck. Wundt 
thought that the difference between the observed 
pendulum position and the actual position would 
provide a means of determining the swiftness of 
thought of the observer. 

Wundt’s analysis was relevant to a longstanding 
problem in astronomy. The problem was that two 
or more astronomers simultaneously using the 
same telescope (with multiple eyepieces) would re- 
port different crossing times as the stars moved 
across a grid line on the telescope. Even in Wundt’s 
time, it was a well-known event in the history of 
science that Kinnebrook, an assistant at the Royal 
Observatory in England, had» been dismissed in 
1796 because his stellar crossing times were nearly 
a full second too slow (Boring, 1950). Wundt’s 
analysis offered another explanation that did not as- 
sume incompetence on the part of anyone. Put sim- 
ply, Wundt believed that the speed of thought might 
differ from one person to the next: 


For each person there must be a certain speed. of 
thinking, which he can never exceed with his given 
mental constitution. But just as one steam engine 
can go faster than another, so this speed of thought 
will probably not be the same in all persons. 
(Wundt, 1862, as translated in Rieber, 1980) 


This analysis of telescope reporting times seems 
simplistic by present-day standards and overlooks 
the possible contribution of such factors as attention, 
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motivation, and self-correcting feedback from prior 
trials. On the positive side, this was at least an em- 
pirical analysis that sought to explain individual 
differences instead of trying to explain them away. 
And that is the relevance to current practices in 
psychological testing. However crudely, Wundt 
measured mental processes and begrudgingly ac- 
knowledged individual differences. ! 


Galton and the First Battery 
of Mental Tests 


Sir Francis Galton (1822-1911) pioneered the new 
experimental psychology in nineteenth-century 
Great Britain. Galton was obsessed with measure- 
ment, and his intellectual career seems to have been 
dominated by a belief that virtually anything was 
measurable. His attempts to measure intellect by 
means of reaction time and sensory discrimination 
tasks are well known. Yet, to appreciate his wide- 
ranging interests, the reader should be apprised that 
Galton also devised techniques for measuring 
beauty, personality, the boringness of lectures, and 
the efficacy of prayer, to name but a few of the en- 
deavors that his biographer has catalogued in elab- 
orate detail (Pearson 1914, 1924, 1930ab). 

Galton was a genius who was more interested 
in the problems of human evolution than in psy- 
chology per se (Boring, 1950). His two most influ- 
ential works were Hereditary Genius (1869), an 
empirical analysis purporting to prove that genetic 
factors were overwhelmingly important for the at- 
tainment of eminence, and Inquiries into Human 
Faculty and Its Development (1883), a disparate se- 
ries of essays that emphasized individual differ- 
ences in mental faculties. 

Boring (1950) regards Inquiries as the begin- 
ning of the mental test movement and the advent of 
the scientific psychology of individual differences. 
The book is a curious mixture of empirical research 
and speculative essays on topics as diverse as “just 
perceptible differences” in lifted weight and di- 
minished fertility among inbred animals. There is, 


1. This emphasis upon individual differences was rare for 
Wundt. He is more renowned for proposing common laws of 
thought for the average adult mind. 


THE HISTORY OF PSYCHOLOGICAL TESTING 


nonetheless, a common theme uniting these diverse 
essays; Galton demonstrates time and again that in- 
dividual differences not only exist but are objec- 
tively measurable. 

Galton borrowed the time-consuming psy- 
chophysical procedures practiced by Wundt and 
others on the European continent and adapted 
them to a series of simple and quick sensorimotor 
measures. Thus, he continued the tradition of brass 
instruments mental testing but with an impor- 
tant difference: his procedures were much more 
amenable to the timely collection of data from hun- 
dreds if not thousands of subjects. Because of his 
efforts in devising practicable measures of individ- 
ual differences, historians of psychological testing 
usually regard Galton as the father of mental test- 
ing (Goodenough, 1949; Boring, 1950). 

To further his study of individual differences, 
Galton set up a psychometric laboratory in London 
at the International Health Exhibition in 1884. It 
was later transferred to the London Museum, where 
it was maintained for six years. Various anthropo- 
metric and psychometric measures were arranged 
on a long table at one side of a narrow room. Sub- 
jects were admitted at one end for threepence and 
given successive tests as they moved down the table. 
At least 17,000 individuals were tested during the 
1880s and 1890s. About 7,500 of the individual data 
records have survived to the present day (Johnson 
et al., 1985). 

The tests and measures involved both the phys- 
ical and behavioral domains. Physical characteris- 
tics assessed were height, weight, head length, head 
breadth, arm span, length of middle finger, and 
length of lower arm, among others. The behavioral 
tests included strength of hand squeeze determined 
by dynamometer, vital capacity of the lungs mea- 
sured by spirometer, visual acuity, highest audible 
tone, speed of blow, and reaction time (RT) to both 
visual and auditory stimuli. 

Ultimately, Galton’s simplistic attempts to 
gauge intellect with measures of reaction time and 
sensory discrimination proved fruitless. Nonethe- 
less, he did provide a tremendous impetus to the 
testing movement by demonstrating that objective 
tests could be devised and that meaningful scores 
could be obtained through standardized procedures. 
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Cattell Imports Brass Instruments 
to the United States 


James McKeen Cattell (1860-1944) studied the 
new experimental psychology with both Wundt and 
Galton before settling at Columbia University 
where, for twenty-six years, he was the undisputed 
dean of American psychology. With Wundt, he did 
a series of painstakingly elaborate RT studies 
(1880-1882), measuring with great precision the 
fractions of a second presumably required for dif- 
ferent mental reactions. He also noted, almost in 
passing, that he and another colleague had small 
but consistent differences in RT. Cattell proposed to 
Wundt that such individual differences ought to be 
studied systematically. Although Wundt acknowl- 
edged individual differences, he was philosophi- 
cally more inclined to study general features of the 
mind, and he offered no support for Cattell’s pro- 
posal (Fancher, 1985). 

But Cattell received enthusiastic support for his 
study of individual differences from Galton, who 
had just opened his psychometric laboratory in Lon- 
don. After corresponding with Galton for a few 
years, Cattell arranged for a two-year fellowship at 
Cambridge so that he could continue the study of in- 
dividual differences. Cattell opened his own research 
laboratory and developed a series of tests that were 
mainly extensions and additions to Galton’s battery. 

Cattell (1890) invented the term mental test in his 
famous paper entitled “Mental Tests and Measure- 
ments.” This paper described his research program, 
detailing ten mental tests he proposed for use with the 
general public. These tests were clearly a reworking 
and embellishment of the Galtonian tradition: 


Strength of hand squeeze as measured by 
dynamometer 

Rate of hand movement through a distance of 
50 centimeters 

Two-point threshold for touch—minimum dis- 
tance at which two points are still perceived 
as separate 

Degree of pressure needed to cause pain—rub- 
ber tip pressed against the forehead 

Weight. differentiation—discern the relative 
weights of identical-looking boxes varying 
by one gram from 100 to 110 grams 


Reaction time for sound—using a device simi- 
lar to Galton’s 

Time for naming colors 

Bisection of a 50-centimeter line 

Judgment of 10 seconds of time 

Number of letters repeated on one hearing 


Strength of hand squeeze seems a curious addi- 
tion to a battery of mental tests, a point that Cattell 
(1890) addressed directly in his paper. He was ofthe 
opinion that it was impossible to separate bodily en- 
ergy from mental energy. Thus, in Cattell’s view, an 
ostensibly physiological measure such as dyna- 
mometer pressure was an index of one’s mental 
power as well. Clearly, the physiological and sen- 
sory bias of the entire test battery reflects its 
strongly Galtonian heritage (Fancher, 1985). 

In 1891, Cattell accepted a position at Colum- 
bia University, at that time the largest university in 
the United States. His subsequent influence on 
American psychology was far in excess of his in- 
dividual scientific output and was expressed in 
large part through his numerous and influential stu- 
dents (Boring, 1950). Among his many famous 
doctoral students and the years of their degrees 
were E. L. Thorndike (1898) who made monu- 
mental contributions to learning theory and educa- 
tional psychology; R: S. Woodworth (1899) who 
was to author the very popular and influential Ex- 
perimental Psychology (1938); and E. K. Strong 
(1911) whose Vocational Interest Blank—since re- 
vised—is still in wide use. But among Cattell’s stu- 
dents, it was probably Clark Wissler (1901) who 
had the greatest influence on the early history of 
psychological testing. 

Wissler obtained both mental test scores and 
academic grades from more than 300 students at 
Columbia University and Barnard College. His goal 
was to demonstrate that the test results could pre- 
dict academic performance. With our early twenty- 
first-century perspective on research and testing, it 
seems amazing that the early experimentalists 
waited so long to do such basic validational re- 
search. Wissler’s (1901) results showed virtually no 
tendency for the mental test scores to correlate with 
academic achievement. For example, class standing 
correlated .16 with memory for number lists, —.08 
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with dynamometer strength, .02 with color naming, 
and —.02 with reaction time. The highest correlation 
(.16) was statistically significant because of the 
large sample size. However, so humble a correla- 
tion carries with it very little predictive utility.? 

Also damaging to the brass instruments testing 
movement was the very modest correlations be- 
tween the mental tests themselves. For example, 
color naming and hand movement speed correlated 
only .19, while RT and color naming correlated 
-.15. Several physical measures such as head size 
(a holdover measure from the Galton era) were, not 
surprisingly, also uncorrelated with the various sen- 
sory and RT measures. 

With the publication of Wissler’s (1901) 
discouraging results, experimental psychologists 
largely abandoned the use of RT and sensory dis- 
crimination as measures of intelligence. From one 
standpoint, this turning away from the brass instru- 
ments approach was a desirable development in the 
history of psychological testing. The way was 
thereby paved for immediate acceptance of Alfred 
Binet’s more sensible and useful measures of 
higher mental processes. 

But in other respects, the abandonment of RT 
and sensory measures was premature and unfortu- 
nate. After all, by contemporary standards Wissler’s 
research methods revealed an extraordinary psy- 
chometric naivete. By using only bright college 
students as subjects, Wissler had inadvertently in- 
troduced an extreme restriction of range, which 
would invariably reduce the size of his correlations. 
If a more heterogeneous sample of subjects had 
been used, the correlations would have been sub- 
stantially larger. In addition, certain measures such 
as RT were inherently unreliable because of the 
small number of trials per subject. Such unreliabil- 
ity in a measure also places a severe restriction on 
the upper bounds of correlation coefficients. 


2. We discuss the correlation coefficient in more detail in 
Topic 3B, Concepts of Reliability. By way of quick preview, cor- 
relations can range from —1.0 to +1.0. Values near zero indicate 
a weak, negligible linear relationship between the two variables. 
For example, correlations between —.20 and +.20 are generally 
of minimal value for purposes of individual prediction. Note 
also that negative correlations indicate an inverse relationship. 
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If Wissler’s (1901) negative findings had been 
more skeptically scrutinized, it might not have been 
a full 70 years later until RT was resurrected as a 
potentially useful intellectual measure. Correla- 
tions of —.40 between complex forms of RT and in- 
telligence are not at all uncommon (Jensen, 1982). 

But that is getting ahead of the story. The more 
common reaction among psychologists in the early 
1900s was to begrudgingly conclude that Galton 
had been wrong in attempting to infer complex 
abilities from simple ones. Goodenough (1949) has 
likened Galton’s approach to “inferring the nature 
of genius from the nature of stupidity or the quali- 
ties of water from those of the hydrogen and 
oxygen of which it is composed.” The academic 
psychologists apparently agreed with her, and 
American attempts to develop intelligence tests vir- 
tually ceased at the turn of the twentieth century. 
For his own part, Wissler was apparently so dis- 
couraged by his results that he immediately 
switched to anthropology, where he became a 
strong environmentalist in explaining differences 
between ethnic groups. 

The void created by the abandonment of the 
Galtonian tradition did not last for long. In Europe, 
Alfred Binet was on the verge of a major break- 
through in intelligence testing. Binet introduced his 
scale of intelligence in 1905, and shortly thereafter 
H. H. Goddard imported it to the United States, 
where it was applied in a manner that Gould (1981) 
has described as “the dismantling of Binet’s inten- 
tions in America.” Whether early twentieth-century 
American psychologists subverted Binet’s inten- 
tions is an important question that we review in the 
next topic. First, we examine the social changes in 
nineteenth-century Europe that created the neces- 
sity for practical intelligence tests. 





Many great inventions have been developed in re- 
sponse to the practical needs created by changes in 


3. The correlations are negative because low scores on RT are 
associated with high scores on intelligence tests. 
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societal values. Such is the case with intelligence 
tests. To be specific, the first such tests were devel- 
oped by Binet in the early 1900s to help identify 
children in the Paris school system who were un- 
likely to profit from ordinary instruction. Prior to 
this time, there was little interest in the educational 
needs of children with mental retardation. A new 
humanism toward those with mental retardation thus 
created the practical problem—identifying those 
with special needs—that Binet’s tests were to solve. 

The Western world of the late 1800s was just 
emerging from centuries of indifference and hos- 
tility toward the psychiatrically and mentally im- 
paired. Medical practitioners were just beginning 
to acknowledge a distinction between individuals 
with emotional disablities and mental retardation. 
For centuries, all such social outcasts were given 
similar treatment. In the Middle Ages, they were 
occasionally “diagnosed” as witches and put to 
death by burning. Later on, they were alternately 
ignored, persecuted, or tortured. In his comprehen- 
sive history of psychotherapy and psychoanalysis, 
Bromberg (1959) has an especially graphic chapter 
on the various forms of maltreatment toward those 
with mental and emotional disabilities, from which 
only one example will be provided here. In 1698, a 
prominent physician wrote a gruesome book, Fla- 
gellum Salutis, in which beatings were advocated 
as treatment “in melancholia; in frenzy; in paraly- 
sis; in epilepsy; in facial expression of feeble- 
minded” (Bromberg, 1959). 

By the early 1800s, saner minds began to prevail. 
Medical practitioners realized that some of those with 
psychiatric impairment had reversible illnesses that 
did not necessarily imply diminished intellect, 
whereas other exceptional persons, those with men- 
tal retardation, showed a greater developmental con- 
tinuity and invariably had impaired intellect. In 
addition, a newfound humanism began to influence 
social practices toward individuals with psychologi- 
cal and mental disabilities. With this humanism there 
arose a greater interest in the diagnosis and remedia- 
tion of mental retardation. At the forefront of these 
developments were two French physicians, J. E. D. 
Esquirol and O. E. Seguin, each of whom revolu- 
tionized thinking about those with mental retarda- 


tion, thereby helping to create the necessity for 
Binet’s tests. 


Esquirol and Diagnosis in Mental Retardation 


Around the beginning of the nineteenth century, 
many physicians had begun to perceive the differ- 
ence between mental retardation (then called id- 
iocy) and mental illness (often referred to as 
dementia). J. E. D. Esquirol (1772-1840) was the 
first to formalize the difference in writing. His 
diagnostic breakthrough was noting that mental re- 
tardation was a lifelong developmental phenome- 
non whereas mental illness usually had a more 
abrupt onset in adulthood. He thought that mental 
retardation was incurable, whereas mental illness 
might show improvement (Esquirol, 1845/1838). 

Esquirol placed great emphasis upon language 
skills in the diagnosis of mental retardation. This may 
offer a partial explanation as to why Binet’s later tests 
and the modern-day descendents from them are so 
heavily loaded on linguistic abilities. After all, the 
original use of the Binet scales was, in the main, to 
identify children with mental retardation who would 
not likely profit from ordinary schooling. 

Esquirol also proposed the first classification 
system in mental retardation and it should be no 
surprise that language skills were the main diag- 
nostic criteria. He recognized three levels of men- 
tal retardation: (1) those using short phrases, 
(2) those using only monosyllables, and (3) those 
with cries only, no speech. Apparently, Esquirol did 
not recognize what we would now call mild mental 
retardation, instead providing criteria for the equiv- 
alents of the modern-day classifications of moder- 
ate, severe, and profound mental retardation. 


Seguin and Education of 
Individuals with Mental Retardation 


Perhaps more than any other pioneer in the field of 
mental retardation, O. Edouard Seguin (1812- 
1880) helped establish a new humanism toward 
those with mental retardation in the late 1800s. He 
had been a student of Esquirol and had also studied 
with J. M. G: Itard (1774-1838), who is well known 
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for his five-year attempt to train the Wild Boy of 
Aveyron, a feral child who had lived in the woods 
for his first 11 or 12 years (Itard, 1932/1801). 

Seguin borrowed from techniques used by Itard 
and devoted his life to developing educational pro- 
grams for persons with mental retardation. As early 
as 1838, he had established an experimental class 
for such individuals. His treatment efforts earned 
him international acclaim and he eventually came 
to the United States to continue his work. In 1866, 
he published /diocy, and Its Treatment by the Phys- 
iological Method, the first major textbook on the 
treatment of mental retardation. This book advo- 
cated a surprisingly modern approach to education 
of individuals with mental retardation and even 
touched on what would now be called behavior 
modification. 

Such was the social and historical background 
that allowed intelligence tests to flourish. We turn 
now to the invention of the modern-day intelligence 
test by Alfred Binet. We begin with a discussion of 
the early influences that shaped his famous test. 


RESEARCH UPON HIS TEST 


As most every student of psychology knows, Al- 
fred Binet (1857-1911) invented the first modern 
intelligence test in 1905. What is less well known, 
but equally important for those who seek an under- 
standing of his contributions to modern psychol- 
ogy, is that Binet was a prolific researcher and 
author long before he turned his attentions to intel- 
ligence testing. The character of his early research 
had a material bearing on the subsequent form of 
his well-known intelligence test. For those who 
seek a full understanding of his pathbreaking in- 
fluence, brief mention of Binet’s early career is 
mandatory. For more details the reader can consult 
DuBois (1970), Fancher (1985), Goodenough 
(1949), Gould (1981), and Wolf (1973). 

Binet began his career in medicine, but. was 
forced to drop out because of a complete emotional 
breakdown. He switched to psychology, where he 
studied the two-point threshold and dabbled in 
the associationist psychology of John Stuart Mill 
(1806-1873). Later, he selected an apprenticeship 
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with the neurologist J. M. Charcot (1825—1893) at 
the famous Salpetriere Hospital. Thus, for a brief 
time Binet’s professional path paralleled that of 
Sigmund Freud, who also studied hysteria under 
Charcot. At the Salpetriere Hospital, Binet co- 
authored (with C. Fere) four studies supposedly 
demonstrating that reversing the polarity of a mag- 
net could induce complete mood changes (e.g., 
from happy to sad) or transfer of hysterical paraly- 
sis (e.g., from left to right side) in a single hypno- 
tized subject. In response to public criticism from 
other psychologists, Binet later published a recan- 
tation of his findings. This was a painful episode 
for Binet, and it sent his career into a temporary de- 
tour. Nonetheless, he learned two things through 
his embarrassment. First, he never again used 
sloppy experimental procedures that allowed for 
unintentional suggestion to influence his results. 
Second, he became skeptical of the zeitgeist (spirit 
of the times) in experimental psychology. Both of 
these lessons were applied when he later developed 
his intelligence scales. 

In 1891, Binet went to work at the Sorbonne as 
an unpaid assistant and began a series of studies 
and publications that were to define his new “in- 
dividual psychology” and ultimately to culminate 
in his intelligence tests. Binet was an ardent exper- 
imentalist, often using his two daughters to try out 
existing and new tests of intelligence. Early on, he 
flirted with a Cattellian approach to intelligence 
testing, using the standard measures of reaction 
time and sensory acuity on his two daughters. The 
results were annoyingly inconsistent and difficult 
to interpret. As might be expected, he found that the 
reaction times of his children were, on average, 
much slower than for adults. But on some trials his 
daughters’ performance approached or exceeded 
adult levels. From these findings, Binet concluded 
that attention was a key component of intelligence, 
which was itself a very multifaceted entity. Fur- 
thermore, he became increasingly disenchanted 
with the brass instruments approach to measuring 
intelligence, which probably explains his subse- 
quent use of measures of higher mental processes. 

In addition, Binet’s sensory-perceptual experi- 
ments with his children greatly influenced his 
views on proper testing procedures: 
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The experimenter is obliged, to a point, to adjust 
his method to the subject he is addressing. There 
are certain rules to follow when one experiments 
ona child, just as there are certain rules for adults, 
for hysterics, and for the insane. These rules are not 
written down anywhere; each one learns them for 
himself and is repaid in great measure. By making 
an error and later accounting for the cause, one 
learns not to make the mistake a second time. In re- 
gard to children, it is necessary to be suspicious of 
two principal causes of error: suggestion and fail- 
ure of attention. This is not the time to speak on the 
first point. As for the second, failure of attention, it 
is so important that it is always necessary to sus- 
pect it when one obtains a negative result. One 
must then suspend the experiments and take them 
up at a more favorable moment, restarting them 10 
times, 20 times, with great patience. Children, in 
fact, are often little disposed to pay attention to ex- 
periments which are not entertaining, and it is use- 
less to hope that one can make them more attentive 
by threatening them with punishment. By particu- 
lar tricks, however, one can sometimes give the ex- 
periment a certain appeal. (Binet, 1895, quoted in 
Pollack, 1971) 


It is interesting to contrast modern-day testing 
practices—which go so far as to specify the exact 
wording the examiner should use—with Binet’s ad- 
vice to exercise nearly endless patience and use en- 
tertaining tricks when testing children. 

BINET AND TESTING FOR HIGHER 

MENTAL PROCESSES 
In 1896, Binet and his Sorbonne assistant, Victor 
Henri, published a pivotal review of German and 
American work on individual differences. In this 
historically important paper, they argued that intel- 
ligence could be better measured by means of the 
higher psychological processes rather than the ele- 
mentary sensory processes such as reaction time. 
After several false starts, Binet and Simon eventu- 
ally settled on the straightforward format of their 
1905 scales, discussed subsequently. 

The character of the 1905 scale owed much to 
a prior test developed by Dr. Blin (1902) and his 


pupil, M. Damaye. They had attempted to improve 
the diagnosis of mental retardation by using a bat- 





tery of assessments in 20 areas such as spoken lan- 
guage; knowledge of parts of the body; obedience 
to simple commands; naming common objects; and 
ability to read, write, and do simple arithmetic. 
Binet criticized the scale for being too subjective, 
for having items reflecting formal education, and 
for using a yes or no format on many questions 
(DuBois, 1970). But he was much impressed with 
the idea of using a battery of tests, a feature which 
he adopted in his 1905 scales. 

In 1904, the Minister of Public Instruction in 
Paris appointed a commission to decide upon the 
educational measures that should be undertaken 
with those children who could not profit from reg- 
ular instruction. The commission concluded that 
medical and educational examinations should be 
used to identify those children who could not learn 
by the ordinary methods. Furthermore, it was de- 
termined that these children should be removed 
from their regular classes and given special in- 
struction suitable to their more limited intellectual 
prowess. This was the beginning of the special ed- 
ucation classroom. 

It was evident that a means of selecting children 
for such special placement was needed, and Binet 
and his colleague Simon were called upon to de- 
velop a practical tool for just this purpose. Thus 
arose the first formal scale for assessing the intelli- 
gence of children. 

Goodenough (1949) has outlined the four ways 
in which the 1905 scale differed from those which 
had been previously constructed. 


1. It made no pretense of measuring precisely any 
single faculty. Rather, it was aimed at assessing 
the child’s general mental development with a 
heterogeneous group of tasks. Thus, the aim was 
not measurement, but classification. 

2. It was a brief and practical test. The test took less 
than an hour to administer and required little in 
the way of equipment. 

3. It measured directly what Binet and Simon re- 
garded as the essential factor of intelligence— 
practical judgment—rather than wasting time 
with lower-level abilities involving sensory, 
motor, and perceptual elements. They took a 
pragmatic view of intelligence: 
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There is in intelligence, it seems to us, a fundamental 
agency the lack or alteration of which has the great- 
est importance for practical life; that is judgement, 
otherwise known as good sense, practical sense, ini- 
tiative, or the faculty of adapting oneself. To judge 
well, to understand well, to reason well—these are 
the essential wellsprings of intelligence. (Binet and 
Simon, 1905; as translated in Fancher, 1985) 


4. The items were arranged by approximate level of 
difficulty instead of content. A rough standard- 


TABLE 1.1 The 1905 Binet-Simon Scale 


. Follows a moving object with the eyes. 
. Grasps a small object which is touched. 
. Grasps a small object which is seen, 


. Finds and eats a square of chocolate wrapped in paper. 


onnw fb wWN 


ization had been done with 50 normal children 
ranging in age from three to 11 years and several 
subnormal and retarded children as well. 


The 30 tests on the 1905 scale ranged from ut- 
terly simple sensory tests to quite complex verbal 
abstractions. Thus, the scale was appropriate for as- 
sessing the entire gamut of intelligence—from se- 
vere mental retardation to high levels of giftedness. 
The entire scale is outlined in Table 1.1. 


. Recognizes the difference between a square of chocolate and a square of wood. 


. Executes simple commands and imitates simple gestures. 
+ Points to familiar named objects, e.g., “Show me the cup.” 
. Points to objects represented in pictures, e.g., “Put your finger on the window.” 


9. Names objects in pictures, e.g., “What is this?” [examiner points to a picture of a sign]. 


10. Compares two lines of markedly unequal length. 
11. Repeats three spoken digits. 

12. Compares two weights. 

13. Shows susceptibility to suggestion. 

14. Defines common words by function. 

15. Repeats a sentence of 15 words. 


16. Tells how two common objects are different, e.g., “paper and cardboard.” 

17. Names from memory as many as possible of 13 objects displayed on a board for 30 seconds. [This test was later 
dropped because it permitted too many possibilities for distraction.] 

18. Reproduces from memory two designs shown for 10 seconds. 

19. Repeats a longer series of digits than in item 11 to test immediate memory. 

20. Tells how two common objects are alike, e.g., “butterfly and flea.” 


21. Compares two lines of slightly unequal length. 
22. Compares five blocks to put them in order of weight. 


23. Indicates which of the previous five weights the examiner has removed. 


24. Produces rhymes, e.g., “What rhymes with ‘school’?” 


25. A word completion test based on those proposed by Ebbinghaus. 
26. Puts three nouns, e.g., “Paris, river, fortune” (or three verbs) in a sentence. 
27. Responds to 25 abstract (comprehension) questions, e.g., “When a person has offended you, and comes to offer 


his apologies, what should you do?” 
28. Reverses the hands of a clock. 


29. After paper folding and cutting, draws the form of the resulting holes. 
30. Defines abstract words by designating the difference between, e.g., “boredom and weariness.” 





Source: Based on translations in Jenkins and Paterson (1961) and Jensen (1980). 
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Except for the very simplest tests, which were 
designed for the classification of very low-grade id- 
iots (an unfortunate diagnostic term that has since 
been dropped), the tests were heavily weighted to- 
ward verbal skills, reflecting Binet’s departure 
from the Galtonian tradition. 

An interesting point that is often overlooked by 
contemporary students of psychology is that Binet 
and Simon did not offer a precise method for arriv- 
ing at a total score on their 1905 scale. It is well to 
remember that their purpose was classification, not 
measurement, and that their motivation was en- 
tirely humanitarian, namely, to identify those chil- 
dren who needed special educational placement. 
By contemporary standards, it is difficult to accept 
the fuzziness inherent in such an approach, but that 
may reflect a modern penchant for quantification 
more than a weakness in the 1905 scale. In fact, 
their scale was popular among educators in Paris. 
And, even with the absence of precise quantifica- 
tion, the approach was successful in selecting can- 
didates for special classes. 


THE REVISED SCALES 
AND THE ADVENT OF IQ 


In 1908, Binet and Simon published a revision of 
the 1905 scale. In the earlier scale, more than half 
the items had been designed for the very retarded, 
yet the major diagnostic decisions involved older 
children and those with borderline intellect. To rem- 
edy this imbalance, most of the very simple items 
were dropped and new items were added at the 
higher end of the scale. The 1908 scale had 58 prob- 
lems or tests, almost double the number from 1905. 
Several new tests were added, many of which are 
still used today: reconstructing scrambled sen- 
tences, copying a diamond, and executing a se- 
quence of three commands. Some of the items were 
absurdities that the children had to detect and ex- 
plain. One such item was amusing to French chil- 
dren: “The body of an unfortunate girl was found, 
cut into 18 pieces. It is thought that she killed her- 
self.” However, this item was very upsetting to some 
American subjects, demonstrating the importance 
of cultural factors in intelligence (Fancher, 1985). 


The major innovation of the 1908 scale was the 
introduction of the concept of mental level. The tests 
had been standardized on about 300 normal children 
between the ages of 3 and 13 years. This allowed 
Binet and Simon to order the tests according to the 
age level at which they were typically passed. 
Whichever items were passed by 80 to 90 percent 
of the 3-year-olds were placed in the 3-year level, 
and similarly on up to age 13. Binet and Simon also 
devised a rough scoring system whereby a basal age 
was first determined from the age level at which not 
more than one test was failed. For each five tests that 
were passed at levels above the basal, a full year of 
mental level was granted. Insofar as partial years of 
mental level were not credited and the various age 
levels had anywhere from three to eight tests, the 
method left much to be desired. 

In 1911, a third revision of the Binet-Simon 
scales appeared. Each age level now had exactly 
five tests. The scale was also extended into the 
adult range. And with some reluctance, Binet in- 
troduced new scoring methods that allowed for 
one-fifth of a year for each subtest passed beyond 
the basal level: In his writings, Binet emphasized 
strongly that the child’s exact mental level should 
not be taken too seriously as an absolute measure 
of intelligence. 

Nonetheless, the idea of deriving a mental level 
was a monumental development that was to influ- 
ence the character of intelligence testing throughout 
the twentieth century. Within months, what Binet 
called mental level was being translated as mental 
age. And testers everywhere, including Binet him- 
self, were comparing a child’s mental age with the 
child’s chronological age. Thus, a 9-year-old who 
was functioning at the mental level (or mental age) 
of a 6-year-old was retarded by three years. Very 
shortly, Stern (1912) pointed out that being retarded 
by three years had different meanings at different 
ages. A 5-year-old functioning at the 2-year-old 
level was more impaired than a 13-year-old func- 
tioning at the 10-year-old level. Stern suggested that 
an intelligence quotient computed from the mental 
age divided by the chronological age would give a 
better measure of the relative functioning of a sub- 
ject compared to his or her same-aged peers. 
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In 1916, Terman and his associates at Stanford 
revised the Binet-Simon scales, producing the 
Stanford-Binet, a successful test that is discussed 
ina later chapter. Terman suggested multiplying the 
intelligence quotient by 100 to remove fractions; he 
was also the first person to use the abbreviation /Q. 
Thus was born one of the most popular and con- 
troversial concepts in the history of psychology. 
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Binet died in 1911 before the IQ swept American 
testing, so we will never know what he would have 
thought of this new development based on his 
scales. However, Simon, his collaborator, later 
called the concept of IQ a “betrayal” of their scale’s 
original objectives (Fancher, 1985, p. 104), and we 
can assume from Binet’s humanistic concern that 
he might have held a similar opinion. 


SUMMARY 


1. For better or for worse, psychological test 
results possess the power to alter lives. A review of 
historical trends is crucial if we desire to compre- 
hend the contemporary influence of psychological 
tests. 


2. Rudimentary forms of testing date back to 
2200 B.c. in China. The Chinese emperors used 
grueling written exams to select officials for civil 
service. 


3. In the mid- to late 1800s, several physi- 
cians and psychiatrists developed standardized pro- 
cedures to reveal the nature and extent of symptoms 
in the mentally ill and brain-injured. For example, 
in 1885, Hubert von Grashey developed the pre- 
cursor to the memory drum to test the visual recog- 
nition skill of brain-injured patients. 


4. Modern psychological testing owes its in- 
ception to the era of brass instruments psychology 
that flourished in Europe during the late 1800s. By 
testing sensory thresholds and reaction times, pio- 
neer test developers such as Sir Francis Galton 
demonstrated that it was possible to measure the 
mind in an objective and replicable manner. 


5. Wilhelm Wundt founded the first psycho- 
logical laboratory in 1879 in Leipzig, Germany. In- 
cluded among his earlier investigations was his 
1862 attempt to measure the speed of thought with 
the thought meter, a calibrated pendulum with nee- 
dles sticking off from each side. 


6. The first reference to mental tests occurred 
in 1890 in a classic paper by James McKeen Cat- 
tell, an American psychologist who had studied 
with Galton. Cattell imported the brass instruments 
approach to the United States. 


7. One of Cattell’s students, Clark Wissler, 
showed that reaction time and sensory discrimina- 
tion measures did not correlate with college grades, 
thereby redirecting the mental-testing movement 
away from brass instruments. 


8. In the late 1800s, a newfound humanism 
toward the mentally retarded, reflected in the diag- 
nostic and remedial work of French physicians Es- 
quirol and Seguin, helped create the necessity for 
early intelligence tests. 


9. Alfred Binet, who was to invent the first 
true intelligence test, began his career by studying 
hysterical paralysis with the French neurologist 
Charcot. Binet’s claim that magnetism could cure 
hysteria was, to his pained embarrassment, dis- 
proved. Shortly thereafter, he switched interests 
and conducted sensory-perceptual studies, using 
his children as subjects. 


10. In 1905, Binet and Simon developed the 
first useful intelligence test in Paris, France. Their 
simple 30-item measure of mainly higher mental 
functions helped identify schoolchildren who could 
not profit from regular instruction. Curiously, there 
was no method for scoring the test. 


11. In 1908, Binet and Simon published a re- 
vised 58-item scale that incorporated the concept of 
mental level. In 1911, a third revision of the Binet- 
Simon scales appeared. Each age level now had ex- 
actly five tests; the scale extended into the adult range. 


12. In 1912, Stern proposed dividing the 
mental age by the chronological age to obtain an 
intelligence quotient. In 1916, Terman suggested 
multiplying the intelligence quotient by 100 to re- 
move fractions. Thus was born the concept of IQ. 
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Early Uses and Abuses of Tests in the United States 

The Invention of Nonverbal Tests in the Early 1900s 

The Stanford-Binet: The Early Mainstay of IQ 

Group Tests and the Classification of WWI Army Recruits 


Early Educational Testing 


The Development of Aptitude Tests 

Personality and Vocational Testing After WWI 

The Origins of Projective Testing 

The Development of Interest Inventories 

Summary of Major Landmarks in the History of Testing 


Summary 


T- Binet-Simon scales helped solve a practi- 
cal social quandary, namely, how to identify 
children who needed special schooling. With this 
successful application of a mental test, psycholo- 
gists realized that their inventions could have prag- 
matic significance for many different segments of 
society. Almost immediately, psychologists in the 
United States adopted a utilitarian focus. Intelli- 
gence testing was embraced by many as a reliable 
and objective response to perceived social prob- 
lems such as the identification of immigrants with 
mental retardation and the quick, accurate classifi- 
cation of Army recruits (Boake, 2002). 

Whether these early tests really solved social 
dilemmas—or merely exacerbated them—is a 
fiercely debated issue reviewed in the following 
sections. One thing is certain: The profusion of 
tests developed early in the twentieth century 
helped shape the character of contemporary tests. 
A review of these historical trends will aid in the 
comprehension of the nature of modern tests and a 
better appreciation of the social issues raised by 
them. 
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EARLY USES AND ABUSES 
OF TESTS IN THE UNITED STATES 


First Translation of the Binet-Simon Scale 


In 1906, Henry H. Goddard was hired by the 
Vineland Training School in New Jersey to do 
research on the classification and education of 
“feebleminded” children. He soon realized that a 
diagnostic instrument would be required and was 
therefore pleased to read of the 1908 Binet-Simon 
scale. He quickly set about translating the scale, 
making minor changes so that it would be applica- 
ble to American children (Goddard, 1910a). 
Goddard (1910b) tested 378 residents of the 
Vineland facility and categorized them by diagno- 
sis and mental age. He classified 73 residents as id- 
iots because their mental age was 2 years or lower; 
205 residents were termed imbeciles with mental 
age of 3 to 7 years; and 100 residents were deemed 
feebleminded with mental age of 8 to 12 years. It is 
instructive to note that originally neutral and 
descriptive terms for portraying levels of mental 
retardation—idiot, imbecile, and feebleminded— 
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have made their way into the everyday lexicon of 
pejorative labels. In fact, Goddard made his own 
contribution by coining the diagnostic term moron 
(from the Greek moronia, meaning “foolish’’). 

Goddard (1911) also tested 1,547 normal chil- 
dren with his translation of the Binet-Simon seales. 
He considered children whose mental age was four 
or more years behind their chronological age to be 
feebleminded—these constituted 3 percent of his 
sample. Considering that all of these children were 
found outside of institutions for the retarded, 3 per- 
cent is rather an alarming rate of mental deficiency. 
Goddard (1911) was of the opinion that these chil- 
dren should be segregated so that they would be 
prevented from “contaminating society.” These 
early studies piqued Goddard’s curiosity about 
“feebleminded” citizenry and the societal burdens 
they imposed. He also gained a reputation as one of 
the leading experts on the use of intelligence tests 
to identify persons with impaired intellect. His tal- 
ents were soon in heavy demand. 


The Binet-Simon and Immigration 


In 1910, Goddard was invited to Ellis Island by the 
commissioner of immigration to help make the ex- 
amination of immigrants more accurate. A dark and 
foreboding folklore had grown up around mental 
deficiency and immigration in the early 1900s: 


It was believed that the feebleminded were degen- 
erate beings responsible for many if not most social 
problems; that they reproduced at an alarming rate 
and menaced the nation’s overall biological fitness; 
and that their numbers were being incremented by 
undesirable “new” immigrants from southern and 
eastern European countries who had largely sup- 
planted the “old” immigrants from northern and 
western Europe. (Gelb, 1986) 


Initially, Goddard was unconcerned about the 
supposed threat of feeblemindedness posed by the 
immigrants. He wrote that adequate statistics did 
not exist and that the prevalent opinions about 
undue percentages of mentally defective immi- 
grants were “grossly overestimated” (Goddard, 
1912). However, with repeated visits to Ellis Island, 
Goddard became convinced that the rates of feeble- 
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mindedness were much higher than estimated by 
the physicians who staffed the immigration service. 
Within a year, he reversed his opinions entirely and 
called for congressional funding so that Ellis Island 
could be staffed with experts trained in the use of 
intelligence tests. In the following decade, Goddard 
became an apostle for the use of intelligence tests 
to identify feebleminded immigrants. Although he 
wrote that the rates of mentally deficient immi- 
grants were “alarming,” he did not join the popular 
call for immigration restriction (Gelb, 1986). 

The story of Goddard and his concern for the 
“menace of feeblemindedness,” as Gould (1981) 
has satirically put it, is often ignored or downplayed 
in books on psychological testing. The majority of 
textbooks on testing do not mention or refer to God- 
dard at all. The few books that do mention him usu- 
ally state that Goddard “used the tests in institutions 
for the retarded,” which is surely an understatement. 
In his influential History of Psychological Testing, 
DuBois (1970) has a portrait of Goddard but devotes 
less than one line of text to him. 

The fact is that Goddard was one of the most in- 
fluential American psychologists of the early 
1900s. Any thoughtful person must therefore won- 
der why so many contemporary authors have ig- 
nored or slighted the person who first translated and 
applied Binet’s tests in the United States. We will 
attempt an answer here, based in part on Goddard’s 
original writing, but also relying upon Gould’s 
(1981) critique of Goddard’s voluminous writings 
on mental deficiency and intelligence testing. We 
refer to Gelb’s (1986) more sympathetic portrayal 
of Goddard as well. 

Perhaps Goddard has been ignored in the text- 
books because he was a strict hereditarian who con- 
ceived of intelligence in simple-minded Mendelian 
terms. No doubt his call for colonization of “mo- 
rons” so as to restrict their breeding has won him 
contemporary disfavor as well. And his insistence 
that much undesirable behavior—crime, alco- 
holism, prostitution—was due to inherited mental 
deficiency also does not sit well with the modern 
environmentalist position. 

However, the most likely reason that modern 
authors have ignored Goddard is that he exempli- 


fied a large number of early, prominent psycholo- 
gists who engaged in the blatant misuse of intelli- 
gence testing. In his efforts to demonstrate that 
high rates of immigrants with mental retardation 
were entering the United States each day, Goddard 
sent his assistants to Ellis Island to administer his 
English translation of the Binet-Simon tests to 
newly arrived immigrants. The tests were adminis- 
tered through a translator, not long after the immi- 
grants walked ashore. We can guess that many of 
the immigrants were frightened, confused, and dis- 
oriented. Thus, a test devised in French, then trans- 
lated to English was, in turn, retranslated back to 
Yiddish, Hungarian, Italian, or Russian; adminis- 
tered to bewildered farmers and laborers who had 
just endured an Atlantic crossing; and interpreted 
according to the original French norms. 

What did Goddard find and what did he make of 
his results? In small samples of immigrants (22 to 
50), his assistants found 83 percent of the Jews, 80 
percent of the Hungarians, 79 percent of the Italians, 
and 87 percent of the Russians to be feebleminded, 
that is, below age 12 on the Binet-Simon scales 
(Goddard, 1917). His interpretation of these findings 
is, by turns, skeptically cautious and then provoca- 
tively alarmist. In one place he claims that his study 
“makes no determination of the actual percentage, 
even of these groups, who are feebleminded.” Yet, 
later in the report he states that his figures would 
only need to be revised by “a relatively small 
amount” in order to find the actual percentages of 
feeblemindedness among immigrant groups. Fur- 
ther, he concludes that the intelligence of the aver- 
age immigrant is low, “perhaps of moron grade,” but 
then goes on to cite environmental deprivation as the 
primary culprit. Simultaneously, Goddard appears 
to favor deportation for low IQ immigrants but also 
provides the humanitarian perspective that we might 
be able to use “moron laborers” if only “we are wise 
enough to train them properly.” 

There is much, much more to the Goddard era 
of early intelligence testing, and the interested 
reader is urged to consult Gould (1981) and Gelb 
(1986). The most important point that we wish to 
stress here is that—like many other early psychol- 
ogists—Goddard’s scholarly views were influ- 
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enced by the social ideologies of his time. Finally, 
Goddard was a complex scholar who refined and 
contradicted his professional opinions on numer- 
ous occasions. One ironic example: After the dam- 
age was done and his writings had helped restrict 
immigration, Goddard (1928) recanted, concluding 
that feeblemindedness was not incurable, and that 
the feebleminded did not need to be segregated in 
institutions. 

The Goddard chapter in the history of testing 
serves as a reminder that even well-meaning per- 
sons operating within generally accepted social 
norms can misuse psychological tests. We need 
be ever mindful that disinterested “science” can 
be harnessed to the goals of a pernicious social 
ideology. 





Because of the heavy emphasis of the Binet-Simon 
scales upon verbal skills, many psychologists real- 
ized that this new measuring device was not entirely 
appropriate for non-English-speaking subjects, 
illiterates, and those with speech and hearing 
impairments. A spate of performance scales there- 
fore arose in the decade following Goddard’s 
1908 translation of the Binet-Simon. Only a brief 
chronology of nonverbal tests will be supplied here. 
The interested reader should consult DuBois 
(1970). In this listing of early performance tests, the 
reader will surely recognize many instruments and 
subtests that are still used today. 

The earliest of the performance measures was 
the Seguin form board, an upright stand with de- 
pressions into which ten blocks of varying shapes 
could be fitted. This had been used by Seguin as a 
training device for individuals with mental retarda- 
tion, but was subsequently developed as a test by 
Goddard, and then standardized by R. H. Sylvester 
(1913). This identical board is still used, with the 
subject blindfolded, in the Halstead-Reitan neu- 
ropsychological test battery (Reitan & Wolfson, 
1985). 

Knox (1914) devised several performance tests 
for use with Ellis Island immigrants. His tests 
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required absolutely no verbal responses from sub- 
jects. The examiner demonstrated each task non- 
verbally to ensure that the subjects understood the 
instructions. Included in his tests were a simple 
wooden puzzle (which Knox referred to as the 
“moron” test) and the same digit-symbol substitu- 
tion test which is now found on most of the Wech- 
sler scales of intelligence. 

Several other early performance tests are 
worthy of brief mention because they have sur- 
vived to the present day in revised form. Pintner 
and Paterson (1917) invented a 15-part scale of 
performance tests that used several form boards, 
puzzles, and object assembly tests. The object as- 
sembly test—reassembling cut-up cardboard ver- 
sions of common objects such as a horse—is a 
mainstay of several contemporary intelligence 
tests. The Kohs Block Design test (Kohs, 1920), 
which required the subject to assemble painted 
blocks to resemble a pattern, is well known to any 
modern tester who uses the Wechsler scales. The 
Porteus Maze Test (Porteus, 1915) is a graded se- 
ries of mazes for which the subject must avoid 
dead ends while tracing a path from beginning to 
end. This is a fine instrument that is still available 
today, but underused. 


THE STANFORD-BINET: 
THE EARLY MAINSTAY OF IQ 


While it was Goddard who first translated the Binet 
scales in the United States, it was Stanford pro- 
fessor Lewis M. Terman (1857-1956) who popu- 
larized IQ testing with his revision of the Binet 
scales in 1916. The new Stanford-Binet, as it was 
called, was a substantial revision, not just an ex- 
tension, of the earlier Binet scales. Among the 
many changes that led to the unquestioned prestige 
of the Stanford-Binet was the use of the now fa- 
miliar IQ for expressing test results. The number of 
items was increased to 90, and the new scale was 
suitable for those with mental retardation, children, 
and both normal and “superior” adults. In addition, 
the Stanford-Binet had clear and well-organized in- 
structions for administration and scoring. Great 
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care had been taken in securing a representative 
sample of subjects for use in the standardization of 
the test. As Goodenough (1949) notes: “The publi- 
cation of the Stanford Revision marked the end of 
the initial period of experimentation and uncer- 
tainty. Once and for all, intelligence testing had 
been put on a firm basis.” 

The Stanford-Binet was the standard of intelli- 
gence testing for decades. New tests were always 
validated in terms of their correlations with this 
measure. It continued its preeminence through re- 
visions in 1937, and 1960, by which time the Wech- 
sler scales (Wechsler, 1949, 1955) had begun to 
compete with it. The latest revision of the Stanford- 
Binet was completed in 2003. This test and the 
Wechsler scales are discussed in detail in a later 
chapter. It is worth mentioning here that the Wech- 
sler scales became a quite popular alternative to the 
Stanford-Binet mainly because they provided more 
than just an IQ score. In addition to Full Scale IQ, 
the Wechsler scales provided ten to twelve subtest 
scores, and a Verbal and Performance IQ. By con- 
trast, the earlier versions of the Stanford-Binet sup- 
plied only a single overall summary score, the 
global IQ. 


CLASSIFICATION OF WWI 
ARMY RECRUITS 


Given the American penchant for efficiency, it was 
only natural that researchers would seek group men- 
tal tests to supplement the relatively time-consum- 
ing individual intelligence tests imported from 
France. Among the first to develop group tests was 
Pyle (1913), who published schoolchildren norms 
for a battery consisting of such well-worn measures 
as memory span, digit-symbol substitution, and oral 
word association (quickly writing down words in re- 
sponse to a stimulus word). Pintner (1917) revised 
and expanded Pyle’s battery, adding to it a timed 
cancellation test in which the child crossed out the 
letter a wherever it appeared in a body of text. 

But group tests were slow to catch on, partly 
because the early versions still had to be scored 
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laboriously by hand. The idea of a completely ob- 
jective test with a simple scoring key was inconsis- 
tent with tests such as logical memory for which 
the judgment of the examiner was required in scor- 
ing. Most amazing of all—at least to anyone who 
has spent any time as a student in American 
schools—the multiple-choice question was not yet 
in general use. 

The slow pace of developments in group test- 
ing picked up dramatically as the United States en- 
tered World War I in 1917. It was then that Robert 
M. Yerkes, a well-known psychology professor at 
Harvard, convinced the U.S. government and the 
Army that all of its 1.75 million recruits should be 
given intelligence tests for purposes of classifica- 
tion and assignment (Yerkes, 1919). Immediately 
upon being commissioned into the Army as a 
colonel, Yerkes assembled a Committee on the Ex- 
amination of Recruits, which met at the Vineland 
school in New Jersey to develop the new group tests 
for the assessment of Army recruits. Yerkes chaired 
the committee; other famous members included 
Goddard and Terman. 

Two group tests emerged from this collabora- 
tion: the Army Alpha and the Army Beta. It would 
be difficult to overestimate the influence of the 
Alpha and Beta upon subsequent intelligence tests. 
The format and content of these tests inspired de- 
velopments in group and individual testing for 
decades to come. We discuss these tests in some de- 
tail so that the reader can appreciate their influence 
on modern intelligence tests. 


The Army Alpha and Beta Examinations 


The Alpha was based on the then unpublished work 
of Otis (1918) and consisted of eight verbally loaded 
tests for average and high-functioning recruits. The 
eight tests were: (1) following oral directions, 
(2) arithmetical reasoning, (3) practical judgment, 
(4) synonym-antonym pairs, (5) disarranged sen- 
tences, (6) number series completion, (7) analogies, 
and (8) information. Figure 1.1 lists some typical 
items from the Army Alpha examination. 

The Army Beta was a nonverbal group test de- 
signed for use with illiterates and recruits whose 
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first language was not English. It consisted of var- 
ious visual-perceptual and motor tests such as 
tracing a path through mazes and visualizing the 
correct number of blocks depicted in a three- 
dimensional drawing. Figure 1.2 depicts the black- 
board demonstrations for all eight parts of the Beta 
examination. 

In order to accommodate illiterate subjects and 
recent immigrants who did not comprehend Eng- 
lish, Yerkes instructed the examiners to use largely 
pictorial and gestural methods for explaining the 
tests to the prospective Army recruits. The exam- 
iner and an assistant stood atop a platform at the 
front of the class and engaged in pantomime to ex- 
plain each of the eight tests. We reproduce here the 
exact instructions for one test so that the reader can 
appraise the likely effects of the testing procedures 
upon Beta results, Keep in mind that many recruits 
could not see or hear the examiner well, and that 
some had never taken a test before. Here is how the 
examiners introduced test 6, picture completion, to 
each new roomful of potential recruits: 


“This is test 6 here. Look. A lot of pictures.” After 
everyone has found the place, “Now watch.” Exam- 
iner points to hand and says to demonstrator, “Fix 
it.” Demonstrator does nothing, but looks puzzled. 
Examiner points to the picture of the hand, and 
then to the place where the finger is missing and 
says to demonstrator, “Fix it; fix it.” Demonstrator 
then draws in finger. Examiner says “That’s right.” 
Examiner then points to fish and place for eye and 
says, “Fix it.” After demonstrator has drawn miss- 
ing eye, examiner points to each of the four re- 
maining drawings and says, “Fix them all.” 
Demonstrator works samples out slowly and with 
apparent effort. When the samples are finished ex- 
aminer says, “All right. Go head. Hurry up!” Dur- 
ing the course of this test the orderlies walk around 
the room and locate individuals who are doing 
nothing, point to their pages and say, “Fix it. Fix 
them,” trying to set everyone working. At the end 
of 3 minutes examiner says, “Stop! But don’t turn 
over the page.” (Yerkes, 1921) 


The Army testing was intended to help segre- 
gate and eliminate the mentally incompetent, to 
classify men according to their mental ability, and 
to assist in the placement of competent men in 
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FOLLOWING ORAL DIRECTIONS 
Mark a cross in the first and also the third circle: 


O O 


ARITHMETICAL REASONING 


Solve each problem: 


How many men are 5 men and 10 men? 
If 3 1/2 tons of coal cost $21, what will 5 1/2 tons cost? 


PRACTICAL JUDGMENT 


O Q Q 


Answer ( ) 
Answer( ) 


Why are high mountains covered with snow? Because 


Ü they are near the clouds 


D- the sun shines seldom on them 


C- the air is cold there 


SYNONYM-ANTONYM PAIRS 


Are these words the same or opposite? 


largess—donation 
accumulate—dissipate 


DISARRANGED SENTENCES 


same? or opposite? 
same? or opposite? 


Can these words be rearranged to form a sentence? 


envy bad malice traits are and 


true? or false? 


NUMBER SERIES COMPLETION 


Complete the series: 
ANALOGIES 


SER 8 


16 18 36 


Which choice completes the analogy? 


tears—sorrow :: laughter— 
granary—wheat :: library— 
INFORMATION 


Choose the best alternative: 
The pancreas is in the 


joy smile girls grin 
desk books paper librarian 


abdomen head shoulder neck 


The Battle of Gettysburg was fought in 1863 1813 1778 1812 


Note: Examinees received verbal instructions for each subtest. 





FIGURE 1.1 Sample Items from the Army Alpha Examination 
Source: Reprinted from Yerkes, R. M. (Ed.). (1921). Psychological examining in the United States Army. 
Memoirs of the National Academy of Sciences, Volume 15. With permission from the National Academy 


of Sciences, Washington, DC. 


responsible positions (Yerkes, 1921). However, it 
is not really clear whether the Army made much 
use of the masses of data supplied by Yerkes and 
his eager assistants. A careful reading of his 
memoirs reveals that Yerkes did little more than 
produce favorable testimonials from high-rank- 


ing officers. In the main, his memoirs say that the 
Army could have saved millions of dollars and in- 
creased its efficiency, if the testing data had been 
used. 

To some extent, the mountains of test data had 
little practical impact on the efficiency of the Army 
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FIGURE 1.2 

The Blackboard Demonstrations for All 
Eight Parts of the Beta Examination 

Source: Reprinted from Yerkes, R. M. (Ed.). (1921). 
Psychological examining in the United States Army. 
Memoirs of the National Academy of Sciences, Vol- 
ume 15. With permission from the National Acad- 
emy of Sciences, Washington, DC. 





because of the resistance of the military mind to lidity of the test results. For example, an internal 
scientific innovation. However, it is also true that memorandum described the use of pantomime in 
the Army brass had good reason to doubt the va- the instructions to the nonverbal Beta examination: 
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For the sake of making results from the various 
camps comparable, the examiners were ordered to 
follow a certain detailed and specific series of bal- 
let antics, which had not only the merit of being 
perfectly incomprehensible and unrelated to mental 
testing, but also lent a highly confusing and dis- 
tracting mystical atmosphere to the whole perfor- 
mance, effectually preventing all approach to the 
attitude in which a subject should be while having 
his soul tested. (cited in Samelson, 1977) 


In addition, the testing conditions left much to be 
desired, with wave upon wave of recruits ushered 
in one door, tested, and virtually shoved out the 
other side. Tens of thousands of recruits received a 
literal zero for many subtests, not because they 
were retarded but because they couldn’t fathom the 
instructions to these enigmatic new instruments. 
Many recruits fell asleep while the testers gave es- 
oteric and mysterious pantomime instructions. 

On the positive side, the Army testing provided 
psychologists with a tremendous amount of expe- 
rience in the psychometrics of test construction. 
Thousands of correlation coefficients were com- 
puted, including the prominent use of multiple 
correlations in the analysis of test data. Test con- 
struction graduated from an art to a science in a few 
short years. 


The Army Tests and Ethnic Differences 


Unfortunately, the Army test results were some- 
times used to substantiate prejudices about various 
racial and ethnic groups rather than to dispassion- 
ately investigate the causes of group differences. 
For example, in his influential book A Study of 
American Intelligence, Brigham (1923) undertook 
a massive analysis of Alpha and Beta scores for 
Nordic, Mediterranean, and Alpine immigrants. 
The text is stuffed with ostensibly objective tables 
and charts comparing racial and ethnic groups. For 
example, one curious figure in his book depicts the 
proportion of each immigration sample at or below 
the average of the African American draft. Brigham 
concluded that African Americans, Mediterranean 
immigrants, and Alpine immigrants were intellec- 
tually inferior. He sounded a dire warning that 


THE HISTORY OF PSYCHOLOGICAL TESTING 


racial intermixture would inevitably cause a deteri- 
oration of American intelligence. For example, the 


caption to one graph reads, in part: 


The distributions of the intelligence scores of the 
entire Nordic group, the combined Mediterranean 
and Alpine groups, and the negro draft. The 
process of racial intermixture cannot result in any- 
thing but an average of these elements, with the re- 
sulting deterioration of American intelligence. 
(Brigham, 1923) 


Seven years later, Brigham (1930) forthrightly 
disavowed his earlier views. He cited cultural and 
language differences as the likely cause of ethnic 
and racial disparities on the Army tests. He asserted 
that comparative studies of national and racial 
groups could not be made with existing tests and 
concluded that his earlier findings were “without 
foundation” (Brigham, 1930). 


Il] EARLY EDUCATIONAL TESTING 


For good or for ill, Yerkes’s grand scheme for test- 
ing Army recruits helped to usher in the era of 
group tests. After WWI, inquiries rushed in from 
industry, public schools, and colleges about the po- 
tential applications of these straightforward tests 
that almost anyone could administer and score 
(Yerkes, 1921). The psychologists who had worked 
with Yerkes soon left the service and carried with 
them to industry and education their newfound no- 
tion of paper-and-pencil tests of intelligence. 

The Army Alpha and Beta were also released 
for general use. These tests quickly became the 
prototypes for a large family of group tests and 
influenced the character of intelligence tests, col- 
lege entrance examinations, scholastic achievement 
tests, and aptitude tests. To cite just one specific 
consequence of the Army testing, the National 
Research Council, a government organization of 
scientists, devised the National Intelligence Test, 
which was eventually given to 7 million children 
in the United States during the 1920s. Thus, such 
well-known tests as the Wechsler scales, the Scho- 
lastic Aptitude Tests, and the Graduate Record 
Exam actually have roots that reach back to Yerkes, 


Otis, and the mass testing of Army recruits during 
WWI. . ; { 

The College Entrance Examination Board 
(CEEB) was established at the turn of the twentieth 
century to help avoid duplication in the testing of 
applicants to U.S. colleges. The early exams had 
been of the short answer essay format, but this was 
to change quickly when C. C. Brigham, a disciple 
of Yerkes, became CEEB secretary after WWI. In 
1925, the College Board decided to construct a 
scholastic aptitude test for use in college admis- 
sions (Goslin, 1963). The new tests reflected the 
now familiar objective format of unscrambling sen- 
tences, completing analogies, and filling in the next 
number in a sequence. Machine scoring was intro- 
duced in the 1930s, making objective group tests 
even more efficient than before. These tests then 
evolved into the present College Board tests, in par- 
ticular, the Scholastic Aptitude Tests, now known 
as the Scholastic Assessment Tests. 

The functions of the CEEB were later sub- 
sumed under the nonprofit Educational Testing Ser- 
vice (ETS). The ETS directed the development, 
standardization, and validation of such well-known 
tests as the Graduate Record Examination, the 
Law School Admissions Test, and the Peace Corps 
Entrance Tests. 

Meanwhile, Terman and his associates at Stan- 
ford were busy developing standardized achieve- 
ment tests. The Stanford Achievement Test (SAchT) 
was first published in 1923; a modern version of it 
is still in wide use today. From the very beginning, 
the SAchT incorporated such modern psychomet- 
ric principles as norming the subtests so that 
within-subject variability could be assessed and se- 
lecting a very large and representative standardiza- 
tion sample. 


THE DEVELOPMENT 
OF APTITUDE TESTS 


Aptitude tests measure more specific and delimited 
abilities than intelligence tests. Traditionally, intel- 
ligence tests assess a more global construct such as 
general intelligence, although there are exceptions 
to this trend that will be discussed later. By con- 
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trast, a single aptitude test will measure just one 
ability domain, and a multiple aptitude test battery 
will provide scores in several distinctive ability 
areas. 

The development of aptitude tests lagged be- 
hind that of intelligence tests for two reasons, one 
statistical, the other social. The statistical problem 
was that a new technique, factor analysis, was 
often needed to discern which aptitudes were pri- 
mary and therefore distinct from each other. Re- 
search on this question had been started quite early 
by Spearman (1904) but was not refined until the 
1930s (Spearman, 1927; Kelley, 1928; Thurstone, 
1938). This new family of techniques, factor 
analysis, allowed Thurstone to conclude that there 
were specific factors of primary mental ability 
such as verbal comprehension, word fluency, num- 
ber facility, spatial ability, associative memory, 
perceptual speed, and general reasoning (Thur- 
stone, 1938; Thurstone & Thurstone, 1941). More 
will be said about this in the later chapters on in- 
telligence and ability testing. The important point 
here is that Thurstone and his followers thought 
that global measures of intelligence did not, so to 
speak, “cut nature at its joints.” As a result, it was 
felt that such measures as the Stanford-Binet were 
not as useful as multiple aptitude test batteries .in 
determining a person’s intellectual strengths and 
weaknesses. 

The second reason for the slow growth of apti- 
tude batteries was the absence of a practical appli- 
cation for such refined instruments. It was not until 
WWII that a pressing need arose to select. candi- 
dates who were highly qualified for very difficult 
and specialized tasks. The job requirements of 
pilots, flight engineers, and navigators were very 
specific and demanding. A general estimate of in- 
tellectual ability, such as provided by the group 
intelligence tests used in WWI, was not sufficient 
to choose good candidates for flight school. The 
armed forces solved this problem by developing a 
specialized aptitude battery of 20 tests that was 
administered to men who passed preliminary 
screening tests. These measures proved invaluable 
in selecting pilots, navigators, and bombadiers, as 
reflected in the much lower washout rates of men 
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selected by test battery instead of the old methods 
(Goslin, 1963). Such tests are still used widely in 
the armed services. 


PERSONALITY AND VOCATIONAL | 
TESTING AFTER WWI 


While such rudimentary assessment methods as 
the free association technique had been used be- 
fore the turn of the twentieth century by Galton, 
Kraepelin, and others, it was not until WWI that 
personality tests emerged in a form resembling 
their contemporary appearance. As has happened 
so often in the history of testing, it was once again 
a practical need that served as the impetus for this 
new development. Modern personality testing 
began when Woodworth attempted to develop an 
instrument for detecting Army recruits who were 
susceptible to psychoneurosis. Virtually all the 
. modern personality inventories, schedules, and 
questionnaires owe a debt to Woodworth’s Per- 
sonal Data Sheet (1919). 

The Personal Data Sheet consisted of 116 ques- 
tions that the subject was to answer by underlining 
Yes or No. The questions were exclusively of the 
“face obvious” variety and, for the most part, in- 
volved fairly serious symptomatology. Representa- 
tive items included: 


¢ Do ideas run through your head so that you can- 
not sleep? 

« Were you considered a bad boy? 

e Are you bothered by a feeling that things are not 
real? 

¢ Do you have a strong desire to commit suicide? 


Readers familiar with the Minnesota Multiphasic 
Personality Inventory (MMPI) must surely recog- 
nize the debt that this more recent inventory has to 
Woodworth’s instrument. 

From his account of how the Personal Data 
Sheet was developed (Woodworth, 1951), it is clear 
that Woodworth took great care in the selection of 
items. In other respects, though, this instrument 
embodies a large dose of psychometric credulity. 
The most serious problem is simply that a disturbed 
subject motivated to look good could do so without 
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detection; likewise, a normal subject with a fake 
bad mentality might be categorized as unfit for ser- 
vice. Modern instruments such as the MMPI have 
incorporated various validity scales for detecting 
such response tendencies. The Personal Data Sheet, 
by contrast, was predicated on the assumption that 
subjects would be honest when responding to the 
questions. 

The next major development was an inventory 
of neurosis, the Thurstone Personality Schedule 
(Thurstone & Thurstone, 1930). After first culling 
hundreds of items answerable in the yes-no-? man- 
ner from Woodworth’s inventory and other sources, 
Thurstone rationally keyed items in terms of how 
the neurotic would typically answer them. Reflect- 
ing Thurstone’s penchant for statistical finesse, this 
inventory was one of the first to use the method of 
internal consistency whereby each prospective item 
was. correlated with the total score on the tenta- 
tively identified scale to determine whether it be- 
longed on the scale. 

From the Thurstone test sprang the Bernreuter 
Personality Inventory (Bernreuter, 1931). It was a 
little more refined than its Thurstone predecessor, 
measuring four personality dimensions: neurotic 
tendency, self-sufficiency, introversion-extrover- 
sion, and dominance-submission. A major innova- 
tion in test construction was that a single test item 
could contribute to more than one scale. 

The Allport-Vernon Study of Values was also 
published in 1931 (Allport & Vernon, 1931). This 
test was quite different from the others in that it 
measured values instead of psychopathology. Fur- 
thermore, it adopted a new scoring method, the ip- 
sative approach, in which the respondent. was 
compared only with himself or herself regarding 
the balance of importance given to six basic values: 
theoretical, economic, aesthetic, social, political, 
and religious. The test was devised in such a man- 
ner that subjects were required to make choices be- 
tween the six values in specific situations. As a 
consequence, the average on the six scales was al- 
ways the same for each subject. A weakness in one 
value was compensated for by a strength in some 
other value. Thus, only the relative peaks and val- 
leys were of interest. 


Any chronology of self-report inventories 
must surely include the Minnesota Multiphasic 
Personality Inventory, or MMPI (Hathaway & 
McKinley, 1940). This test and its revision, the 
MMPI-2, are discussed in detail later. It will suf- 
fice for now to point out that the scales of the 
MMPI were constructed by the method that Wood- 
worth pioneered, contrasting the responses of nor- 
mal and psychiatrically disturbed subjects. In 
addition, the MMPI introduced the use of validity 
scales to determine fake bad, fake good, and ran- 
dom response patterns. 






l| THE ORIGINS OF 


||| PROJECTIVE TESTING 


The projective approach originated with the word 
association method pioneered by Francis Galton in 
the late 1800s. Galton gave himself four seconds to 
come up with as many associations as possible to a 
stimulus word, and then categorized his associa- 
tions as parrotlike, image-mediated, or histrionic 
representations. This latter category convinced him 
that mental operations “sunk wholly below the 
level of consciousness” were at play. Some histori- 
ans have even speculated that Freud’s application 
of free association as a therapeutic tool in psycho- 
analysis sprang from Galton’s paper published in 
Brain in 1879 (Forrest, 1974). 

Galton’s work was continued in Germany by 
Wundt and Kraepelin, and finally brought to 
fruition by Jung (1910). Jung’s test consisted of 
100 stimulus words. For each word, the subject 
was to reply as quickly as possible with the first 
word coming to mind. Kent and Rosanoff (1910) 
gave the association method a distinctively Amer- 
ican flavor by tabulating the reactions of 1,000 
normal subjects to a list of 100 stimulus words. 
These tables were designed to provide a basis for 
comparing the reactions of normal and “insane” 
subjects. 

While the Americans were pursuing the empir- 
ical approach to objective personality testing, a 
young Swiss psychiatrist, Hermann Rorschach 
(1884-1922), was developing a completely differ- 
ent vehicle for studying personality. Rorschach was 
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strongly influenced by Jungian and psychoanalytic 
thinking, so it was natural that his new approach fo- 
cused on the tendency of patients to reveal their in- 
nermost conflicts unconsciously when responding 
to ambiguous stimuli. The Rorschach and other 
projective tests discussed subsequently were pred- 
icated upon the projective hypothesis: When re- 
sponding to ambiguous or unstructured stimuli, we 
inadvertently disclose our innermost needs, fan- 
tasies, and conflicts. 

Rorschach was convinced that people revealed 
important personality dimensions in their responses 
to inkblots. He spent years developing just the right 
set of ten inkblots and systematically analyzed the 
responses of personal friends and different patient 
groups (Rorschach, 1921). Unfortunately, he died 
only a year after his monograph was published, and 
it was up to others to complete his work. Develop- 
ments in the Rorschach are reviewed later in the 
text. 

While Rorschach’s test was originally devel- 
oped to reveal the innermost workings of the ab-. 
normal subject, the TAT, or Thematic Apperception 
Test (Morgan & Murray, 1935), was developed as 
an instrument to study normal personality. Of 
course, both have since been expanded for testing 
with the entire continuum of human behavior. 

The TAT consists of a series of pictures that 
largely depict one or more persons engaged in an 
ambiguous interaction. The subject is shown one 
picture at a time and told to make up a story about 
it. He or she is instructed to be as dramatic as pos- 
sible, to discuss thoughts and feelings, and to de- 
scribe the past, present, and future of what is 
depicted in the picture. 

Murray (1938) believed that underlying per- 
sonality needs, such as the need for achievement, 
would be revealed by the contents of the stories. 
Although numerous scoring systems were devel- 
oped, clinicians in the main have relied upon an 
impressionistic analysis to make sense of TAT pro- 
tocols. Modern applications of the TAT are dis- 
cussed in a later chapter. 

The sentence completion technique was also 
begun during this era with the work of Payne 
(1928). There have been numerous extensions and 
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variations on the technique, which consists of giving 
subjects a stem such as “I am bored when fi 
and asking them to complete the sentence. Some 
modern applications are discussed later, but it can 
be mentioned now that the problem of scoring and 
interpretation, which vexed early sentence comple- 
tion test developers, is still with us today. 

An entirely new approach to projective testing 
was taken by Goodenough (1926), who tried to 
determine not just intellectual level, but also the 
interests and personality traits of children by ana- 
lyzing their drawings. Buck’s (1948) test, the 
House-Tree-Person, was a little more standardized 
and structured and required the subject to draw a 
house, atree, and a person. Machover’s (1949) Per- 
sonality Projection in the Drawing of the Human 
Figure was the logical extension of the earlier 
work. Figure drawing as a projective approach to 
understanding personality is still used today, and a 
later chapter discusses modern developments in 
this practice. 

Meanwhile, projective testing in Europe was 
dominated by the Szondi Test, a wacky instrument 
based on wholly faulty premises. Lipot Szondi was 
a Hungarian-born Swiss psychiatrist who believed 
that major psychiatric disorders were caused by 
recessive genes. His test consisted of 48 photo- 
graphs of psychiatric patients divided into six sets 
of the following eight types: homosexual, epilep- 
tic, sadistic, hysteric, catatonic, paranoiac, manic, 
and depressive (Deri, 1949). From each set of eight 
pictures, the subject was instructed to select the two 
pictures he or she liked best and the two disliked 
most. A person who consistently preferred one kind 
of picture in the six sets was presumed to have 
some recessive genes that made him or her have 
sympathy for the pictured person. Thus, projective 
preferences were presumed to reveal recessive 
genes predisposing the individual to specific psy- 
chiatric disturbances. 

Deri (1949) imported the test to the United 
States and changed the rationale. She did not argue 
for a recessive genetic explanation of picture choice 
but explained such preferences on the basis of un- 
conscious identification with the characteristics of 
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the photographed patients. This was a more palat- 
able theoretical basis for the test than the dubious 
genetic theories of Szondi. Nonetheless, empirical 
research cast doubt on the validity of the Szondi 
Test, and it shortly faded into oblivion (Borstel- 
mann & Klopfer, 1953). 


| THE DEVELOPMENT 
| OF INTEREST INVENTORIES 


While the clinicians were developing measures for 
analyzing personality and unconscious conflicts, 
other psychologists were devising measures for 
guidance and counseling of the masses of more 
normal persons. Chief among such measures was 
the interest inventory, which has roots going back 
to Thorndike’s (1912) study of developmental 
trends in the interests of 100 college students. In 
1919-1920, Yoakum developed a pool of 1,000 
items relating to interests from childhood through 
early maturity (DuBois, 1970). Many of these items 
were incorporated in the Carnegie Interest Inven- 
tory. Cowdery (1926-27) improved and refined 
previous work on the Carnegie instrument by in- 
creasing the number of items, comparing responses 
of three criterion groups (doctors, engineers, and 
lawyers) with control groups of nonprofessionals, 
and developing a weighting formula for items. He 
was also the first psychometrician to realize the im- 
portance of cross validation. He tested his new 
scales on additional groups of doctors, engineers, 
and lawyers to ensure that the discriminations 
found in the original studies were reliable group 
differences rather than capitalizations on error 
variance. 

Edward K. Strong (1884-1963) revised 
Cowdery’s test and devoted 36 years to the devel- 
opment of empirical keys for the modified instru- 
ment known as the Strong Vocational Interest 
Blank (SVIB). Persons taking the test could be 
scored on separate keys for several dozen occupa- 
tions, providing a series of scores of immeasurable 
value in vocational guidance. The SVIB became 
one of the most widely used tests of all time 
(Strong, 1927). Its modern version, the Strong In- 





terest Inventory, is still widely used by guidance 
counselors. 

For decades the only serious competitor to the 
SVIB was the Kuder Preference Record (Kuder, 
1934). The Kuder differed from the Strong by forc- 
ing choices within triads of items. The Kuder was 
an ipsative test; that is, it compared the relative 
strength of interests within the individual, rather 
than comparing his or her responses to various pro- 
fessional groups. More recent revisions of the 
Kuder Preference Record include the Kuder Gen- 
eral Interest Survey and the Kuder Occupational In- 
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terest Survey (Kuder, 1966; Kuder & Diamond, 
1979; Zytowski, 1985). 


SUMMARY OF MAJOR LANDMARKS 
IN THE HISTORY OF TESTING 


We conclude our historical survey of psychologi- 
cal testing with a brief tabular summary of land- 
mark events up to 1950 (Table 1.2). The interested 
reader can find a more detailed listing—including 
a chronology of post-1950 developments—in 
Appendix A. 


TABLE 1.2 A Summary of Early Landmarks in the History of Testing 


2200 B.c. Chinese begin civil service examinations. 

A.D.1862 Wilhelm Wundt uses a calibrated pendulum to measure the “speed of 
thought.” 

1884 Francis Galton administers the first test battery to thousands of citizens at 
the International Health Exhibit. 

1890 James McKeen Cattell uses the term mental test in announcing the agenda 
for his Galtonian test battery. 

1901 Clark Wissler discovers that Cattellian “brass instruments” tests have no 
correlation with college grades. 

1905 Binet and Simon invent the first modern intelligence test. 

1914 Stern introduces the IQ, or intelligence quotient: the mental age divided by 
chronological age. 

1916 Lewis Terman revises the Binet-Simon scales, publishes the Stanford- 
Binet. Revisions appear in 1937, 1960, and 1986. 

1917 Robert Yerkes spearheads the development of the Army Alpha and Beta 
examinations used for testing WWI recruits. 

1917 Robert Woodworth develops the Personal Data Sheet, the first personality 
test. 

1920 Rorschach Inkblot test published. 

1921 Psychological Corporation—the first major test publisher—founded by 
Cattell, Thorndike, and Woodworth. 

1927 First edition of the Strong Vocational Interest Blank published. 

1939 Wechsler-Bellevue Intelligence Scale published. Revisions published in 
1955, 1981, and 1997. 

1942 Minnesota Multiphasic Personality Inventory published. 

1949 Wechsler Intelligence Scale for Children published. Revisions published 


in 1974, 1991. 





28 CHAPTER1 


THE HISTORY OF PSYCHOLOGICAL TESTING 


SUMMARY 


1. In 1910, Henry Goddard translated the 
1908 Binet-Simon scale. In 1911, he tested more 
than a thousand schoolchildren with the test, rely- 
ing upon the original French norms. He was: dis- 
turbed to find that 3 percent of the sample was 
“feebleminded” and recommended segregation 
from society for these children. 


2. Nonverbal intelligence tests were invented 
in the early 1900s to facilitate testing of non- 
English-speaking immigrants. For example, Knox 
published a wooden puzzle test in 1914 and also 
used the now familiar digit-symbol substitution test. 


3. In 1916, Lewis Terman released the Stan- 
ford-Binet, a revision of the Binet scales. This 
well-designed and carefully normed test placed 
intelligence testing on a firm footing once and for 
all. 


4. During WWI, Robert Yerkes headed a team 
of psychologists who produced the Army Alpha, a 
verbally loaded group test for average and superior 
recruits, and the Army Beta, a nonverbal group test 
for illiterates and non-English-speaking recruits. 


5. Early testing pioneers such as C. C. 
Brigham used results of individual and group in- 
telligence tests to substantiate ethnic differences in 
intelligence and thereby justify immigration re- 
strictions. Later, some of these testing pioneers dis- 
avowed their prior views. 


6. Educational testing fell under the purview 
of the College Entrance Examination Board 
(CEEB), founded at the turn of the twentieth cen- 
tury. In 1947, the CEEB was replaced by the Edu- 


cational Testing Service (ETS), which supervised 
the release of such well-known tests as the Scholas- 
tic Aptitude Tests and the Graduate Record Exam. 


7. The advent of multiple aptitude test batter- 
ies was made possible with the development of fac- 
tor analysis by L. L. Thurstone and others. Later, 
the improvement of these test batteries was spurred 
on by the practical need for selecting WWII recruits 
for highly specialized positions. 

8. Personality testing began with Wood- 
worth’s Personal Data Sheet, a simple yes-no 
checklist of symptoms used to screen WWI recruits 
for psychoneurosis. Many later inventories, includ- 
ing the popular Minnesota Multiphasic Personality 
Inventory, borrowed content from the Personal 
Data Sheet. 


9. Projective testing began with the word as- 
sociation technique pioneered by Francis Galton 
and brought to fruition by C. G. Jung in 1910. Her- 
mann Rorschach published his famous inkblot test 
in 1921. 


10. The Thematic Apperception Test (TAT), a 
picture storytelling test introduced in 1935 by Mor- 
gan and Murray, was based upon the projective 
hypothesis: When responding to ambiguous or un- 
structured stimuli, examinees inadvertently dis- 
close their innermost needs, fantasies, and conflicts. 


11. The assessment of vocational interest 
began with Yoakum’s Carnegie Interest Inventory 
developed in 1919-1920. After several revisions 
and extensions, this instrument emerged as E. K. 
Strong’s Vocational Interest Blank. 
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T: historical introduction in the preceding 
chapter has acquainted the reader with only a 
small fraction of the many types, and uses of 
psychological tests. These early tests were used 
predominantly for two purposes: to measure intel- 
ligence and to detect personality disorders. It is un- 
derstandable, then, that the average citizen equates 
psychological testing with IQ scores, inkblots, and 
personality inventories. Certainly, there is. more 
than a grain of truth to this common view: Mea- 
sures of personality and intelligence are still the 
essential mainstays of psychological testing. How- 
ever, psychometricians have developed many other 
kinds of tests for diverse and imaginative purposes 
that the early testing pioneers might never have an- 
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ticipated. This chapter provides a panoramic sur- 
vey of psychological tests and their numerous ap- 
plications. In Topic 2A, The Nature and Uses of 
Psychological Tests, we summarize the different 
types and varied applications of modern tests. In 
Topic 2B, The Testing Process, we emphasize that 
testing is a transaction between tester and exami- 
nee, not a sterile process of measurement. 

From birth to old age, we encounter tests at al- 
most every turning point in life. The baby’s first 
test, conducted immediately after birth is the Apgar 
test, a quick, multivariate assessment of heart rate, 
respiration, muscle tone, reflex irritability, and 
color (Clarke-Stewart & Friedman, 1987). The total 
Apgar score (0 to 10) helps determine the need 
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for any immediate medical attention. Later, a 
toddler who previously received a low Apgar score 
might be a candidate for developmental disability 
assessment. The preschool child may take school- 
readiness tests. Once a school career is begun, each 
student endures hundreds, perhaps thousands of 
academic tests before graduation. Not to mention 
possible tests for learning disability, giftedness, vo- 
cational interest, and college admission. After grad- 
uation, adults may face tests for job entry, driver’s 
license, security clearance, personality function, 
marital compatibility, developmental disability, 
brain dysfunction—the list is nearly endless. Some 
persons even encounter one final indignity in the 
frailness of their later years: a test to determine 
their competency to manage financial affairs. 

The idea of a test is thus a pervasive element of 
our culture, a feature we take for granted. However, 
the layperson’s notion of a test does not necessar- 
ily coincide with the more restrictive view held by 
psychometricians. A psychometrician is a spe- 
cialist in psychology or education who develops 
and evaluates psychological tests. Because of wide- 
spread misunderstandings about the nature of tests, 
it is fitting that we begin this topic with a funda- 
mental question, one that defines the scope of the 
entire book: What is a test? 


"| DEFINITION OF A TEST 


A test is a standardized procedure for sampling be- 
havior and describing it with categories or scores. 
In addition, most tests have norms or standards by 
which the results can be used to predict other, more 
important behaviors. We elaborate these character- 
istics in the sections that follow, but first it is in- 
structive to portray the scope of the definition. 
Included in this view are traditional tests such as 
personality questionnaires and intelligence tests, 
but the definition also subsumes diverse procedures 
that the reader might not recognize as tests. For ex- 
ample, all of the following could be tests according 
to the definition used in this book: a checklist for 
rating the social skills of a youth with mental re- 


tardation; a nontimed measure of mastery in adding 
pairs of three-digit numbers; microcomputer ap- 
praisals of reaction time; and even situational tests 
such as observing an individual working on a group 
task with two “helpers” who are obstructive and un- 
cooperative. 

In sum, tests areenormously varied in their for- 
mats and applications. Nonetheless, most tests pos- 
sess these defining features: 


¢ Standardized procedure 

¢ Behavior sample 

e Scores or categories 

¢ Norms or standards 

e Prediction of nontest behavior 


In the sections that follow, we examine each of these 
characteristics in more detail. The portrait that we 
draw pertains especially to norm-referenced tests— 
tests that use a well-defined population of persons 
for their interpretive framework. However, the 
defining characteristics of a test differ slightly for 
the special case of criterion-referenced tests—tests 
that measure what a person can do rather than com- 
paring results to the performance levels of others. 
For this reason, we provide a separate discussion of 
criterion-referenced tests. 

Standardized procedure is an essential fea- 
ture of any psychological test. A test is considered 
to be standardized if the procedures for adminis- 
tering it are uniform from one examiner and setting 
to another. Of course, standardization depends to 
some extent upon the competence of the examiner. 
Even the best test can be rendered useless by a care- 
less, poorly trained, or ill-informed tester, as the 
reader will discover in Topic 2B, The Testing Pro- 
cess. However, most examiners are competent. 
Standardization therefore rests largely upon the di- 
rections for administration found in the instruc- 
tional manual that typically accompanies a test. 

The formulation of directions is an essential 
step in the standardization of a test. In order to 
guarantee uniform administration procedures, the 
test developer must provide comparable stimulus 
materials to all testers, specify with considerable 
precision the oral instructions for each item or sub- 


TOPIC2A THE NATURE AND USES OF PSYCHOLOGICAL TESTS 31 


test, and advise the examiner how to handle a wide 
range of queries from the examinee. 

To illustrate these points, consider the number 
of different ways a test developer might approach 
the assessment of digit span—the maximum num- 
ber of orally presented digits a subject can recall 
from memory. An unstandardized test of digit span 
might merely suggest that the examiner orally pre- 
sent increasingly long series of numbers until the 
subject fails. The number of digits in the longest se- 
ries recalled would then be the subject’s digit span. 
Most readers can discern that such a loosely de- 
fined test will lack uniformity from one examiner 
to another. If the tester is free to improvise any se- 
ries of digits, what is to prevent him or her from 
presenting, with the familiar inflection of a televi- 
sion announcer, “1-800-325-3535”? Such a series 
would be far easier to recall than a more random 


set, such as, ““7-2-8-1-9-4-6-3-7-4-2.” The speed of ° 


presentation would also crucially affect the unifor- 
mity of a digit span test. For purposes of standard- 
ization, it is essential that every examiner present 
each series at a constant rate, for example, one digit 
per second. Finally, the examiner needs to know 
how to react to unexpected responses such as a sub- 
ject asking, “Could you repeat that again?” For ob- 
vious reasons, the usual advice is “No.” 

The test developer may even go so far as to rec- 
ommend a desired demeanor in the examiner such 
as maintaining a neutral facial expression when 
recording a subject’s response. These seemingly 
subtle influences can have a serious impact upon 
the uniformity of testing procedures. For example, 
an examiner who smirks when recording answers 
might cause the subject to become anxious and fail 
an easy task. We discuss the potential influence of 
the examiner upon test results in the next topic, The 
Testing Process. 

A psychological test is also a limited sample of 
behavior. Neither the subject nor the examiner has 
sufficient time for truly comprehensive testing, 
even when the test is targeted to a well-defined and 
finite behavior domain. Thus, practical constraints 
dictate that a test is only a sample of behavior. Yet, 
the sample of behavior is of interest only insofar as 


it permits the examiner to make inferences about 
the total domain of relevant behaviors. For exam- 
ple, the purpose of a vocabulary test is to determine 
the examinee’s entire word stock by requesting de- 
finitions of a very small but carefully selected sam- 
ple of words. Whether the subject can define the 
particular 35 words from a vocabulary subtest (e.g., 
on the Wechsler Adult Intelligence Scale, or the 
WAIS-R) is of little direct consequence. But the in- 
direct meaning of such results is of great import be- 
cause it signals the examinee’s general knowledge 
of vocabulary. 

An interesting point—and one little understood 
by the lay public—is that the test items need not re- 
semble the behaviors that the test is attempting to 
predict. The essential characteristic of a good test 
is that it permits the examiner to predict other be- 
haviors—not that it mirrors the to-be-predicted be- 
haviors. If answering “true” to the question “I drink 
a lot of water” happens to help predict depression, 
then this seemingly unrelated question is a useful 
index of depression. Thus, the reader will note that 
successful prediction is an empirical question an- 
swered by appropriate research. While most tests 
do sample directly from the domain of behaviors 
they hope to predict, this is not a psychometric 
requirement. 

A psychological test must also permit the de- 
rivation of scores or categories. Thorndike (1918) 
expressed the essential axiom of testing in his fa- 
mous assertion, “Whatever exists at all exists in 
some amount.” McCall (1939) went a step further, 
declaring, “Anything that exists in amount can 
be measured.” Testing strives to be a form of 
measurement akin to procedures in the physical sci- 
ences whereby numbers represent abstract dimen- 
sions such as weight or temperature. Every test 
furnishes one or more scores or provides evidence 
that a person belongs to one category and not an- 
other. In short, psychological testing sums up per- 
formance in numbers or classifications. 

The implicit assumption of the psychometric 
viewpoint is that tests measure individual differ- 
ences in traits or characteristics that exist in some 
vague sense of the word. In most cases, all people 
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are assumed to possess the trait or characteristic 
being measured, albeit in different amounts. The 
purpose of the testing is to estimate the amount of 
the trait or quality possessed by an individual. 

In this context, two cautions are worth men- 
tioning. First, every test score will always reflect 
some degree of measurement error. The impreci- 
sion of testing is simply unavoidable: Tests must 
rely upon an external sample of behavior to esti- 
mate an unobservable and therefore inferred char- 
acteristic. Psychometricians often express this 
fundamental point with an equation: 


X=T+e 


where X is the observed score, T is the true score, 
and e is a positive or negative error component. The 
best that a test developer can do is make e very 
small. It can never be completely eliminated, nor 
can its exact impact be known in the individual 
case. We discuss the concept of measurement error 
in Topic 3B, Concepts of Reliability. 

The second caution is that test consumers must 
be wary of reifying the characteristic being mea- 
sured. Test results do not represent a thing with 
physical reality. Typically, they portray an abstrac- 
tion that has been shown to be useful in predicting 
nontest behaviors. For example, in discussing a per- 
son’s IQ, psychologists are referring to an abstrac- 
tion that has no direct, material existence but that 
is, nonetheless, useful in predicting school achieve- 
ment and other outcomes. 

A psychological test must also possess norms 
or standards. An examinee’s test score is usually in- 
terpreted by comparing it with the scores obtained 
by others on the same test. For this purpose, test de- 
velopers typically provide norms—a summary of 
test results for a large and representative group of 
subjects (Petersen, Kolen, & Hoover, 1989). The 
norm group is referred to as the standardization 
sample. 

The selection and testing of the standard- 
ization sample is crucial to the usefulness of a 
test. This group must be representative of the pop- 
ulation for whom the test is intended or else it is not 
possible to determine an examinee’s relative stand- 
ing. In the extreme case when norms are not pro- 


vided, the examiner can make no use of the test re- 
sults at all. An exception to this point occurs in the 
case of criterion-referenced tests, discussed later. 

Norms not only establish an average perfor- 
mance, but also serve to indicate the frequency with 
which different high and low scores are obtained. 
Thus, norms allow the tester to determine the de- 
gree to which a score deviates from expectations. 
Such information can be very important in predict- 
ing the nontest behavior of the examinee. Norms 
are of such overriding importance in test interpre- 
tation that we consider them at length in a separate 
section later in this text. 

Finally, tests are not ends in themselves. In gen- 
eral, the ultimate purpose of a test is to predict 
additional behaviors, other than those directly sam- 
pled by the test. Thus, the tester may have more in- 
terest in the nontest behaviors predicted by the test 
than in the test responses per se. Perhaps a concrete 
example will clarify this point. Suppose an exam- 
iner administers an inkblot test to a patient in a psy- 
chiatric hospital. Assume that the patient responds 
to one inkblot by describing it as “eyes peering 
out.” Based on established norms, the examiner 
might then predict that the subject will be highly 
suspicious and a poor risk for individual psycho- 
therapy. The purpose of the testing is to arrive at 
this and similar predictions—not to determine 
whether the subject perceives eyes staring out from 
the blots. 

The ability of a test to predict nontest behavior 
is determined by an extensive body of validational 
research, most of which is conducted after the test 
is released. But there are no guarantees in the world 
of psychometric research. It is not unusual for a test 
developer to publish a promising test, only to read 
years later that other researchers find it deficient. 
There is a lesson here for test consumers: The fact 
that a test exists and purports to measure a certain 
characteristic is no guarantee of truth in advertis- 
ing. A test may have a fancy title, precise instruc- 
tions, elaborate norms, attractive packaging, and 
preliminary findings—but if in the dispassionate 
study of independent researchers the test fails to 
predict appropriate nontest behaviors, then it is 
useless. 
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| FURTHER DISTINCTIONS IN TESTING 


The chief features of a test previously outlined 
apply especially to norm-referenced tests, which 
constitute the vast majority of tests in use. In a 
norm-referenced test, the performance of each 
examinee is interpreted in reference to a relevant 
standardization sample (Petersen, Kolen, & Hoover, 
1989). However, these features are less relevant in 
the special case of criterion-referenced tests, since 
these instruments suspend the need for comparing 
the individual examinee with a reference group. In 
a criterion-referenced test, the objective is to de- 
termine where the examinee stands with respect to 
very tightly defined educational objectives (Berk, 
1984). For example, one part of an arithmetic test 
for 10-year-olds might measure the accuracy level 
in adding pairs of two-digit numbers. In an untimed 
test of 20 such problems, accuracy should be nearly 
perfect. For this kind of test, it really does not matter 
how the individual examinee compares to others of 
the same age. What matters is whether the exami- 
nee meets an appropriate, specified criterion—for 
example, 95 percent accuracy. Because there is 
no comparison to the normative performance of 
others, this kind of measurement tool is aptly des- 
ignated a criterion-referenced test. The important 
distinction here is that, unlike norm-referenced 
tests, criterion-referenced tests can be meaning- 
fully interpreted without reference to norms. We 
discuss criterion-referenced tests in more detail in 
Topic 3A, Norms and Test Standardization. 
Another important distinction is between test- 
ing and assessment, which are often considered 
equivalent. However, they do not mean exactly the 
same thing. Assessment is a more comprehensive 
term, referring to the entire process of compiling 
information about a person and using it to make in- 
ferences about characteristics and to predict be- 
havior. Assessment can be defined as appraising or 
estimating the magnitude of one or more attributes 
in a person. The assessment of human characteris- 
tics involves observations, interviews, checklists, 
inventories, projectives, and other psychological 
tests. In sum, tests represent only one source of in- 
formation used in the assessment process. In as- 


sessment, the examiner must compare and combine 
data from different sources. This is an inherently 
subjective process that requires the examiner to sort 
out conflicting information and make predictions 
based on a complex gestalt of data. 

The term assessment was invented during 
World War II to describe a program to select men 
for secret service assignment in the Office of 
Strategic Services (OSS Assessment Staff, 1948). 
The OSS staff of psychologists and psychiatrists 
amassed a colossal amount of information on can- 
didates during four grueling days of written tests, 
interviews, and personality tests. In addition, the 
assessment process included a variety of real-life 
situational tests based on the realization that there 
was a difference between know-how and can-do: 


We made the candidates actually attempt the tasks 
with their muscles or spoken words, rather than 
merely indicate on paper how the tasks could be 
done. We were prompted to introduce realistic tests 
of ability by such findings as this: that men who 
earn a high score in Mechanical Comprehension, a 
paper-and-pencil test, may be below average when 
it comes to solving mechanical problems with their 
hands. (OSS Assessment Staff, 1948) 


The situational tests included group tasks of 
transporting equipment across a raging brook and 
scaling a 10-foot-high wall, as well as individual 
scrutiny of the ability to survive a realistic interro- 
gation and to command two uncooperative subor- 
dinates in a construction task. 

On the basis of the behavioral observations and 
test results, the OSS staff rated the candidates on 
dozens of specific traits in such broad categories as 
leadership, social relations, emotional stability, 
effective intelligence, and physical ability. These 
ratings served as the basis for selecting OSS 
personnel. 


| TYPES OF TESTS 


Tests can be broadly grouped into two camps: group 
tests versus individual tests. Group tests are largely 
pencil-and-paper measures suitable to the testing of 
large groups of persons at the same time. Individual 
tests are instruments. that by their design and 
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purpose must be administered one on one. An 
important advantage of individual tests is that the 
examiner can gauge the level of motivation of 
the subject and assess the relevance of other fac- 
tors (e.g., impulsiveness or anxiety) upon the test 
results. . 

For convenience, we will sort tests into the 
eight categories depicted in Table 2.1. Each of the 
categories contains norm-referenced, criterion- 
referenced, individual, and group tests. The reader 
will note that any typology of tests is a purely arbi- 
trary determination. For example, we could argue 
for yet another dichotomy: tests that seek to mea- 
sure maximum performance (e.g., an intelligence 
test) versus tests that seek to gauge a typical re- 
sponse (e.g., a personality inventory). 

In a narrow sense, there are hundreds—perhaps 
thousands—of different kinds of tests, each mea- 
suring a slightly different aspect of the individual. 
For example, even two tests of intelligence might 
be arguably different types of measures. One test 
might reveal the assumption that intelligence is a 
biological construct best measured through brain 
waves, whereas another might be rooted in the 
traditional view that intelligence is exhibited in the 
capacity to learn acculturated skills such as vocab- 


TABLE 2.1 


ulary. Lumping both measures under the category 
of intelligence tests is certainly an oversimplifica- 
tion, but nonetheless a useful starting point. 

As the reader willrecall from the first chapter, 
intelligence tests were originally designed to sam- 
ple a broad assortment of skills in order to estimate 
the individual’s general intellectual level. The 
Binet-Simon scales were successful, in part, be- 
cause they incorporated heterogeneous tasks, in- 
cluding word definitions, memory for designs, 
comprehension questions, and spatial visualization 
tasks. The group intelligence tests that blossomed 
with such profusion during and after WWII also 
tested diverse abilities—witness the Army Alpha 
with its eight different sections measuring practical 
judgment, information, arithmetic, and reasoning, 
among other skills. 

Modern intelligence tests also emulate this his- 
torically established, pattern by sampling a wide 
variety of proficiencies deemed important in our 
culture. In general, the term intelligence test refers 
to a test that yields an overall summary score based 
on results from a heterogeneous sample of items. 
Of course, such a test might also provide a profile 
of subtest scores as well, but it is the overall score 
that generally attracts the most attention. 


The Main Types of Psychological Tests 





Intelligence Tests: Measure an individual’s ability in relatively global areas.such as ver- 
bal comprehension, perceptual organization, or reasoning and thereby help determine 
potential for scholastic work or certain occupations. 

Aptitude Tests: Measure the capability for a relatively specific task or type of skill; ap- 
titude tests are, in effect, a narrow form of ability testing. 


Achievement Tests: Measure a person’s degree of learning, success, or accomplishment 


in a subject or task. 


Creativity Tests: Assess novel, original thinking and the capacity to find unusual or un- 
expected solutions, especially for vaguely defined problems. 

Personality Tests: Measure the traits, qualities, or behaviors that determine a person’s 
individuality; such tests include checklists, inventories, and projective techniques. 
Interest Inventories: Measure an individual’s preference for certain activities or topics 
and thereby help determine occupational choice. 

Behavioral Procedures: Objectively describe and count the frequency of a behavior, 
identifying the antecedents and consequences of the behavior. 

Neuropsychological Tests: Measure cognitive, sensory, perceptual, and motor perfor- 
mance to determine the extent, locus, and behavorial consequences of brain damage. 
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Aptitude tests measure one or more clearly de- 
fined and relatively homogeneous segments of abil- 
ity. Such tests come in two varieties: single aptitude 
tests and multiple aptitude test batteries. A single 
aptitude test appraises, obviously, only one ability, 
whereas a multiple aptitude test battery provides a 
profile of scores for a number of aptitudes. 

Aptitude tests are often used to predict success 
in an occupation, training course, or educational 
endeavor. For example, the Seashore Measures of 
Musical Talents (Seashore, 1938), a series of tests 
covering pitch, loudness, rhythm, time, timbre, and 
tonal memory, can be used to identify children with 
potential talent in music. Specialized aptitude tests 
also exist for the assessment of clerical skills, me- 
chanical abilities, manual dexterity, and artistic 
ability. These tests are reviewed in Topic 8A, Apti- 
tude Tests and Factor Analysis. 

The most common use of aptitude tests is to de- 
termine college admissions. Most every college 
student is familiar with the SAT (Scholastic As- 
sessment Test, previously called the Scholastic Ap- 
titude Test) of the College Entrance Examination 
Board. This test contains a Verbal section stressing 
word knowledge and reading comprehension and a 
Mathematics section stressing algebra, geometry, 
and insightful reasoning. In effect, colleges that re- 
quire certain minimum scores on the SAT for ad- 
mission are using the test to predict academic 
success. 

Achievement tests measure a person’s degree 
of learning, success, or accomplishment in a sub- 
ject matter. The implicit assumption of most 
achievement tests is that the schools have taught the 
subject matter directly. The purpose of the test is 
then to determine how much of the material the 
subject has absorbed or mastered. Achievement 
tests commonly have several subtests, such as read- 
ing, mathematics, language, science, and social 
studies. These tests are reviewed in Topic 8B, 
Group Tests of Achievement. 

The distinction between aptitude and achieve- 
ment tests is more a matter of use than content 
(Gregory, 1994a). In fact, any test can be an apti- 
tude test to the extent that it helps predict future 
performance. Likewise, any test can be an achieve- 


ment test insofar as it reflects how much the sub- 
ject has learned. In practice, then, the distinction 
between these two kinds of instruments is deter- 
mined by their respective uses. On occasion, one 
instrument may serve both purposes, acting as an 
aptitude test to forecast future performance and an 
achievement test to monitor past learning. 

Creativity tests assess a subject’s ability to 
produce new ideas, insights, or artistic creations 
that are accepted as being of social, aesthetic, or 
scientific value. Thus, measures of creativity em- 
phasize novelty and originality in the solution of 
fuzzy problems or the production of artistic works. 
A creative response to one problem is illustrated in 
Figure 2.1. 

Tests of creativity have a checkered history. In 
the 1960s, they were touted as a useful alter- 
native to intelligence tests and used widely in 
U.S. school systems. Educators were especially 





a b c 


Note: Without lifting the pencil, draw through all the dots with as 
few straight lines as possible. The usual solution is shown in a. 
Creative solutions are depicted in b and c. 





FIGURE 2.1 Solutions to the Nine-Dot Problem as 
Examples of Creativity 
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impressed that creativity tests required divergent 
thinking—putting forth a variety of answers to a 
complex or fuzzy problem—as opposed to conver- 
gent thinking—finding the single correct solution 
to a well-defined problem. For example, a creativ- 
ity test might ask the examinee to imagine all the 
things that would happen if clouds had strings trail- 
ing from them down to the ground. Students who 
could come up with a large number of conse- 
quences were assumed to be more creative than 
their less-imaginative colleagues. However, some 
psychometricians are skeptical, concluding that 
creativity is just another label for applied intelli- 
gence (e.g., McNemar, 1964). 

Personality tests measure the traits, qualities, 
or behaviors that determine a person’s individual- 


ity; this information helps predict future behavior. 
These tests come in several different varieties, in- 
cluding checklists, inventories, and projective tech- 
niques such as sentence completions and inkblots 
(Table 2.2). 

Interest inventories measure an individual’s 
preference for certain activities or topics and 
thereby help determine occupational choice. These 
tests are based upon the explicit assumption that in- 
terest patterns determine and therefore also predict 
job satisfaction. For example, if the examinee has 
the same interests as successful and satisfied ac- 
countants, it is thought likely that he or she would 
enjoy the work of an accountant. The assumption 
that interest patterns predict job satisfaction is 
largely born out by empirical studies, as we will re- 


TABLE 2.2 Examples of Personality Test Items 


(a) An Adjective Checklist 


Check those words which describe you: 


( ) relaxed 
( ) thoughtful 
( ) cheerful 
( ) impatient 
( ) morose 
( ) optimistic 


( ) assertive 

( ) curious 

( ) even-tempered 
( ) skeptical 

( ) impulsive 

( ) anxious 


(b) A True-False Inventory 


Circle true or false as each statement applies to you: 


a ler der Mlle: Di nn Milo BI Dez 
"hth, “eH S BE 


F [like sports magazines. 

Most people would lie to get a job. 

I like big parties where there is lots of noisy fun. 
Strange thoughts possess me for hours at a time. 

I often regret the missed opportunities in my life. 
Sometimes I feel anxious for no reason at all. 

I like everyone I have met. 

Falling asleep is seldom a problem for me. 


(c) A Sentence Completion Projective Test 


Complete each sentence with the first thought that comes to you: 


I feel bored when 
What I need most is 
I like people who 
My mother was 
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view in Topic 12A, Interests and Values in Voca- 
tional Assessment. 

Many kinds of behavioral procedures are 
available for assessing the antecedents and conse- 
quences of behavior, including checklists, rating 
scales, interviews, and structured observations. 
These methods share a common assumption that 
behavior is best understood in terms of clearly de- 
fined characteristics such as frequency, duration, 
antecedents, and consequences. Behavioral proce- 
dures tend to be highly pragmatic in that they are 
usually interwoven with treatment approaches. 

Neuropsychological tests are used in the 
assessment of persons with known or suspected 
brain dysfunction. Neuropsychology is the study 
of brain-behavior relationships. Over the years, 
neuropsychologists have discovered that certain 
tests and procedures are highly sensitive to the ef- 
fects of brain damage. Neuropsychologists use 
these specialized tests and procedures to make in- 
ferences about the locus, extent, and consequences 
of brain damage. 

Although neuropsychological tests and proce- 
dures are helpful in arriving at a neurological diag- 
nosis, their primary purpose is to evaluate the 
sensory, motor, cognitive, and behavioral strengths 
and weaknesses of the neurologically impaired pa- 
tient.! The evaluation of strengths and weaknesses 
in these patients is crucial for documenting im- 
provement, charting the extent of decline in degen- 
erative diseases, and planning effective remediation 
for specific disabilities. A full neuropsychological 
assessment typically requires three to eight hours of 
one-on-one testing with an extensive battery of 
measures. Examiners must undergo comprehensive 
advanced training in order to make sense out ofthe 
resulting mass of test data. We review individual 


1. Advanced radiological techniques such as Computerized To- 
mography (CT) scan, Magnetic Resonance Imaging (MRI), and 
Positron Emission Tomography (PET) scan now allow neurolo- 
gists to make exceedingly accurate inferences about the pres- 
ence, location, and causes of brain damage. However, this does 
not diminish the importance of neuropsychological testing in de- 
termining the functional consequences of brain damage in the 
life of the individual patient. 


tests and the major test batteries in Topic 9B, Neu- 
ropsychological and Geriatric Assessment. 


Il] USES OF TESTING 


By far the most common use of psychological tests 
is to make decisions about persons. For example, 
educational institutions frequently use tests to de- 
termine placement levels for students, and univer- 
sities ascertain who should be admitted, in part, on 
the basis of test scores. State, federal, and local civil 
service systems also rely heavily upon tests for 
purposes of personnel selection. 

Even the individual practitioner exploits tests, 
in the main, for decision making. Examples include 
the consulting psychologist who uses a personality 
test to determine that a police department hire one 
candidate and not another, and the neuropsycholo- 
gist who employs tests to conclude that a client has 
suffered brain damage. 

But simple decision making is not the only 
function of psychological testing. It is convenient 
to distinguish five uses of tests: 


e Classification 

e Diagnosis and treatment planning 
Self-knowledge 

e Program evaluation 

e Research 


These applications frequently overlap and, on oc- 
casion, are difficult to distinguish one from another. 
For example, a test that helps determine a psychi- 
atric diagnosis might also provide a form of self- 
knowledge. Let us examine these applications in 
more detail. 

The term classification encompasses a variety 
of procedures that share a common purpose: as- 
signing a person to one category rather than another. 
Of course, the assignment to categories is not an end 
in itself but the basis for differential treatment of 
some kind. Thus, classification can have important 
effects such as granting or restricting access to a 
specific college or determining whether a person is 
hired for a particular job. There are many variant 
forms of classification, each emphasizing a partic- 
ular purpose in assigning persons to categories. We 
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will distinguish placement, screening, certification, 
and selection. 

Placement is the sorting of persons into differ- 
ent programs appropriate to their needs or skills. For 
example, universities often use a mathematics place- 
ment exam to determine whether students should 
enroll in calculus, algebra, or remedial courses. 

Screening refers to quick and simple tests or 
procedures to identify persons who might have spe- 
cial characteristics or needs. Ordinarily, psycho- 
metricians acknowledge that screening tests will 
result in many misclassifications. Examiners are 
therefore advised to do follow-up testing with ad- 
ditional instruments before making important deci- 
sions on the basis of screening tests. For example, 
to identify children with highly exceptional talent 
in spatial thinking, a psychologist might adminis- 
ter a 10-minute paper-and-pencil test to every child 
in a school system. Students who scored in the top 
10 percent might then be singled out for more com- 
prehensive testing. 

Certification and selection both have a pass/ 
fail quality. Passing a certification exam confers 
privileges. Examples include the right to practice 
psychology or to drive a car. Thus, certification typ- 
ically implies that a person has at least a minimum 
proficiency in some discipline or activity. Selection 
is similar to certification in that it confers privileges 
such as the opportunity to attend a university or to 
gain employment. 

Another use of psychological tests is for diag- 
nosis and treatment planning. Diagnosis consists of 
two intertwined tasks: determining the nature and 
source of a person’s abnormal behavior, and classi- 
fying the behavior pattern within an accepted diag- 
nostic: system. Diagnosis is usually a precursor to 
remediation or treatment of personal distress or im- 
paired performance. 

Psychological tests often play an important role 
in diagnosis and treatment planning. For example, 
intelligence tests are absolutely essential in the di- 
agnosis of mental retardation. Personality tests are 
helpful in diagnosing the nature and extent of emo- 
tional disturbance. In fact, some tests such as the 
MMPI were devised for the explicit purpose of in- 
creasing the efficiency of psychiatric diagnosis. 


Diagnosis should be more than mere classifica- 
tion, more than the assignment of a label. A proper 
diagnosis conveys information—about strengths, 
weaknesses, etiology, and best choices for remedi- 
ation/treatment. Knowing that a child has received 
a diagnosis of learning disability is largely use- 
less. But knowing in addition that the same child is 
well below average in reading comprehension, is 
highly distractible, and needs help with basic phon- 
ics can provide an indispensable basis for treatment 
planning. 

Psychological tests also can supply a potent 
source of self-knowledge. In some cases, the feed- 
back a person receives from psychological tests can 
change a career path or otherwise alter a person’s 
life course. Of course, not every instance of psy- 
chological testing provides self-knowledge. Per- 
haps in the majority of cases the client already 
knows what the test results divulge. A high-func- 
tioning college student is seldom surprised to find 
that his IQ is in the superior range. An architect is 
not perplexed to hear that she has excellent spatial 
reasoning skills. A student with meager reading ca- 
pacity is usually not startled to receive a diagnosis 
of “learning disability.” 

Another use for psychological tests is the sys- 
tematic evaluation of educational and social pro- 
grams. We have more to say about the evaluation of 
educational programs when we discuss achieve- 
ment tests in a later chapter. We focus here upon the 
use of tests in the evaluation of social programs. 
Social programs are designed to provide services 
that improve social conditions and community life, 
For example, Project Head Start is a federally 
funded program that supports nationwide pre- 
school teaching projects for underprivileged chil- 
dren (Cicerelli, 1969; McKey and others, 1985). 
Launched in 1965 as a precedent-setting attempt to 
provide child development programs to low-in- 
come families, Head Start has provided educational 
enrichment and health services to millions of at- 
risk preschool children. 

But exactly what impact does the multi-billion- 
dollar Head Start program have on early childhood 
development? Congress wanted to know if the pro- 
gram improved scholastic performance and reduced 
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school failure among the enrollees. But the centers 
vary by sponsoring’ agencies, staff characteristics, 
coverage, content, and objectives, so the effects of 
Head Start are not easy to ascertain. Psychological 
tests provide an objective basis for answering these 
questions that is far superior to anecdotal or im- 
pressionistic reporting. In general, Head Start chil- 
dren show immediate gains in IQ, school readiness, 
and academic achievement, but these gains dissi- 
pate in the ensuing years (Figure 2.2). 

So far we have discussed the practical applica- 
tion of psychological tests to everyday problems 
such as job selection, diagnosis, or program evalua- 
tion. In each of these instances, testing serves an im- 
mediate, pragmatic purpose: helping the tester make 
decisions about persons or programs. But tests also 
play a major role in both the applied and theoretical 
branches of behavioral research. As an example of 
testing in applied research, consider the problem 
faced by neuropsychologists who wish to investigate 
the hypothesis that low-level lead absorption causes 
behavioral deficits in children. The only feasible 
way to explore this supposition is by testing normal 
and lead-burdened children with a battery of psy- 
chological tests. Needleman and associates (1979) 
used an array of traditional and innovative tests to 
conclude that low-level lead absorption causes 
decrements in IQ, impairments in reaction time, and 
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FIGURE 2.2 Longitudinal Test Results from 

the Head Start Project 

Source: From McKey, R. H., and others. (1985). The impact of 
Head Start on children, families and communities. Washington, 
DC: U.S. Government Printing Office. In the public domain. 


escalations of undesirable classroom’ behaviors. 
Their conclusions inspired a tumultuous and bitter 
exchange of opinions that we will not review here 
(Needleman etal., 1990). However, the passions in- 
spired by this study epitomize an instructive point: 
Academicians and public policymakers respect psy- 


- chological tests. Why else would they engage in 


lengthy, acrimonious debates about the validity of 
testing-based research findings? 

On occasion, tests serve a less-worldly role by 
helping scientists investigate theoretical matters that 
have no immediate or obvious practical applica- 
tions. For example, to analyze perceptual field 
dependence, Witkin (1949) invented the tilting- 
room-tilting-chair tests (TRTC). The apparatus for 
these tests consists of a boxlike room, suspended on 
ball-bearing pivots so that it can be tilted by any 
amount to left or right. Inside the room is a chair for 
the subject, which can also be tilted independent of 
the room. The subject’s task is to bring his or her 
body to a position that is perceived as upright. Field- 
dependent subjects align their bodies somewhat to 
the room rather than the perceived force of gravity. 
Field-independent subjects are less affected by the 
misaligned room, more attuned to their internal per- 
ceptual signals; that is, their perceptual judgments 
are relatively independent of the distorting visual in- 
formation. The TRTC inspired a lifetime of research 
on personality development, but was seldom ap- 
plied to any practical problems of testing. 


I] WHO MAY OBTAIN TESTS 


Test developers, publishers, and psychological ex- 
aminers generally release psychological tests only 
to qualified persons who have a legitimate need to 
study or use these materials. There are three rea- 
sons why access to psychological tests is restricted: 


1. In the hands of unqualified persons, psycholog- 
ical tests can cause harm. 

2. The selection process is rendered invalid for per- 
sons who preview test questions. 

3. Leakage of item content to the general public 
completely destroys the efficacy of a test. 


We examine each of these points in more detail. 
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An unqualified examiner may err in the selec- 
tion, administration, scoring, or interpretation of 
psychological tests, which could cause harm to the 
subject. The possibilities for error and harm are 
almost limitless, so we provide only a typical 
example here. A common mistake among inexpe- 
rienced examiners is the failure to give credit to an 
older subject for the easier, unadministered items 
on a subscale. For example, on a hypothetical 20- 
item subscale of intelligence, the test manual 
might specify that an older subject should en- 
counter only items 11 through 20, on the assump- 
tion that the easier items (1 through 10) would 
surely be answered correctly. Nonetheless, the test 
instructions might specify that the examinee 
should receive point credit for items 1 through 10. 
Failure to tally these points would cause the score 
to be drastically low, with negative consequences 
for the subject. 

Another reason for limiting the availability of 
tests is that illicit access to their content undermines 
the-effectiveness of selection processes. Put simply, 
examinees who have prior access to a selection test 
can learn to produce the desired test results. Finally, 
it should be obvious that leakage of test items to the 
public renders a test completely useless. If individ- 
uals can memorize the answers to test questions, 
their performance on the test will be artificially in- 
flated. To give an extreme example, a blind person 
could pass a color vision test by memorizing the 
correct responses. 


SOURCES OF INFORMATION 
ON TESTS 


A textbook on psychological testing cannot begin 
to survey all the tests of potential interest to read- 
ers. There are simply too many tests! Furthermore, 
dozens of new and useful tests are developed each 
year. The serious student of psychological tests will 
need guidelines and strategies for learning about 
tests, not a static list of recommendations. 
Information about psychological tests is avail- 
able from five sources: reference books, publisher’s 
catalogues, journals, databases, and test manuals. 
We cite a few prominent examples from each cate- 


gory, but we hope the reader will assimilate a strat- 
egy for knowledge acquisition rather than relying 
exclusively on these specific citations. 

The best single reference source for informa- 
tion on mainstream tests is the Mental Measure- 
ments Yearbook (MMY) published by the Buros 
Institute for Mental Measurement at the University 
of Nebraska. A new edition of the MMY is issued 
periodically (Buros, 1978; Conoley & Impara, 
1995; Conoley & Kramer, 1989, 1992; Mitchell, 
1985; Plake & Impara, 2001). The MMY includes 
critical reviews of tests and a listing of important 
references. + 

Although somewhat dated, the Test Corpora- 
tion of America has published several excellent ref- 
erence books on testing. The Test Critiques series 
(Volumes I-X), edited by Keyser and Sweetland 
(1984-1994), provides in-depth evaluation of psy- 
chological, educational, and business tests. More 
recently, Maltby, Lewis, and Hill (2001) have as- 
sembled expert reviews of 250 psychological tests 
covering health psychology; social psychology; 
personality and individual differences; develop- 
mental, occupational, educational, and cognitive 
psychology. An excellent source of short, straight- 
forward clinical measures is Measures for Clinical 
Practice: A Sourcebook, by Corcoran and Fischer 
(1994). In this compendium, instruments for adults, 
children, couples, and families are cross-indexed 
by problem area. 

Another way to learn about tests is to request 
catalogues from the major test publishers. Appen- 
dix B (Test Publisher Addresses) lists the names 
and addresses of prominent U.S. publishers and 
distributors of tests. Appendix C (Major Tests and 
Their Publishers) provides a categorized list of no- 
table tests and their publishers. The latest MMY 
contains a more comprehensive directory of pub- 
lishers and tests. 

Many psychological journals publish articles 
on the reliability and validity of better-known tests. 
The best way to locate studies on a specific test is 
through PsychINFO, a computerized database of 
abstracts from dozens of psychology-relevant jour- 
nals, some going back to 1887 in some cases. If a 
user enters the full name of a test as the key search 
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phrase, a list of relevant articles will appear. Very 
recent articles are not in the PsychINFO database, 
so it is always worthwhile to peruse the latest is- 
sues of relevant journals such as the following: 


Applied Neuropsychology 

Assessment 

Advances in Personality Assessment 

The Clinical Neuropsychologist í 
European Journal of Psychological Assessment 
Journal of Learning Disabilities 

The Journal of Psychoeducational Assessment 
Psychology in the Schools 

The Journal of School Psychology 


Educational and Psychological Measurement 

Psychological Assessment 

The Journal of Clinical and: Experimental 
Neuropsychology 

The Journal of Clinical Psychology 

The Journal of Personality Assessment 


This is a partial list—many other journals publish 
occasional test reviews. 

Finally, an important and often overlooked 
source of information about any specific test is its 
manual. A good manual contains essential infor- 
mation about norms, standardization, administra- 
tion, reliability, and validity. 


SUMMARY 


1. A test can be defined as a standardized pro- 
cedure for sampling behavior and describing it with 
categories or scores. In addition, most tests have 
norms or standards by which the results can be used 
to predict other, more important behaviors. 


2. Tests always constitute a sample of behav- 
ior, never the totality of that which the examiner 
seeks to measure. For this reason, test results al- 
ways incorporate some degree of measurement 
error. 


3. In a norm-referenced test, the examinee’s 
test score is interpreted in relation to scores ob- 
tained by others on the same test. In a criterion-ref- 
erenced test, the emphasis is on what the examinee 
can do with respect to very tightly defined educa- 
tional criteria. 


4. Assessment is the process of compiling in- 
formation about a person and using it to make 
inferences about characteristics and to predict be- 
havior. Assessment incorporates testing but is more 
comprehensive and may include observations, in- 
terviews, and other sources of information. 


5. Group tests are pencil-and-paper measures 
suitable to testing large groups of persons at one 
time. Individual tests are designed for one-on-one 
administration; the examiner can thereby observe 
motivation and other characteristics of the examinee. 


6. An arbitrary but useful classification of 
psychological tests is as follows: intelligence, apti- 
tude, achievement, creativity, personality, interest, 
behavioral, and neuropsychological. The charac- 
teristics of these tests are outlined in Table 2.1. 


7. Five uses of tests may be distinguished: 
classification, diagnosis and treatment planning, 
self-knowledge, program evaluation, and research. 


8. Classification can be further broken down 
into placement, the sorting of persons into appro- 
priate programs; screening, quick identification of 
persons with special characteristics or needs; and 
certification (e.g., for a driver’s license) and selec- 
tion (e.g., for college). 


9. Access to psychological tests is strictly 
controlled so that only persons with appropriate 
training may gain access to them. Many test pub- 
lishers divide tests into three levels of complexity 
that require increasing degrees of expertise for their 
application. 


10. Sources of information about tests include 
the Mental Measurements Yearbook series and the 
Test Critiques volumes. Some journals such as As- 
sessment and The Journal of Psychoeducational 
Assessment also feature information about psycho- 
logical tests. 
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KEY TERMS AND CONCEPTS 


psychometrician p. 30 

test p.30 

standardized procedure p. 30 
norms p.32 
standardization sample p. 32 
norm-referenced test p. 33 
criterion-referenced test p. 33 
assessment p. 33 

group tests p. 33 
individual tests p. 33 
intelligence tests p. 34 
aptitude tests p. 35 
achievement tests p. 35 


creativity tests p. 35 
creativity p.35 
personality tests p. 36 
interest inventories p. 36 
behavioral procedures p. 37 
neuropsychological tests p. 37 
classification p. 37 
placement p. 38 
screening p. 38 
certification p. 38 
diagnosis p. 38 

learning disability p. 38 


Topic 2B The Testing Process 


Standardized Procedures in Test Administration 

Case Exhibit 2.1 The Impact of Nonstandard Testing 
Desirable Procedures of Test Administration 

Influence of the Examiner 
Background and Motivation of the Examinee 


Issues in Scoring 
Summary 


Key Terms and Concepts 


Petosen testing is a dynamic process in- 
fluenced by many factors. Although examin- 
ers strive to ensure that test results accurately 
reflect the traits or capacities being assessed, many 
extraneous factors can sway the outcome of psy- 
chological testing. In this section, we review the 
potentially crucial impact of several sources of in- 
fluence: the manner of administration, the charac- 
teristics of the tester, the context of the testing, the 
motivation and experience of the examinee, and the 
method of scoring. 

The sensitivity of the testing process to extrane- 
ous influences is obvious in those rare, egregious 
spectacles of nonstandard testing that are reported 
from time to time (Case Exhibit 2.1). However, in- 
valid test results do not originate only from obvious 
sources such as blatantly nonstandard administra- 
tion, hostile tester, noisy testing room, scared ex- 
aminee, or careless scoring. In addition, there are 
numerous, subtle ways in which method, examiner, 
context, motivation, or scoring can alter test results. 
We provide a comprehensive survey of these extra- 
neous influences in the remainder of this topic. 


STANDARDIZED PROCEDURES 
IN TEST ADMINISTRATION 


The interpretation of a psychological test is most 
reliable when the measurements are obtained under 
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the standardized conditions outlined in the pub- 
lisher’s test manual. Nonstandard testing proce- 
dures can alter the meaning of the test results, 
rendering them invalid and therefore misleading. 
Standardized procedures are so important that they 
are listed as an essential criterion for valid testing in 
the Standards for Educational and Psychological 
Testing (1985, 1999), areference manual published 
jointly by the American Psychological Association 
and other groups: 


In typical applications, test administrators should 
follow carefully the standardized procedures for 
administration and scoring specified by the test 
publisher. Specifications regarding instructions to 
test takers, time limits, the form of item presenta- 
tion or response, and test materials or equipment 
should be strictly observed. Exceptions should be 
made only on the basis of carefully considered pro- 
fessional judgment, primarily in clinical applica- 
tions. (AERA, APA, NCME, 1985) 


Suppose the instructions to the vocabulary sec- 
tion of a children’s intelligence test specify that the 
examiner should ask, “What does sofa mean, what 
is a sofa?” If a subject were to reply, “I’ve never 
heard that word,” an inexperienced tester might be 
tempted to respond, “You know, a couch—what is 
a couch?” This may strike the reader as a harmless 
form of fair play, a simple rephrasing of the origi- 
nal question. Yet, by straying from standardized 
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procedures, the examiner has really given a differ- 
ent test. The point in asking for a definition of sofa 
(and not couch) is precisely that sofa is harder to 
define and therefore a better index of ee vo- 
er skills. 

` Even though standardized testing procedures 
are normally essential, there are instances in which 
flexibility in procedures is desirable or even neces- 
sary. As suggested in the APA Standards, such de- 
viations should be reasoned and deliberate. An 
analogy to the spirit of the law versus the letter of 
the law is relevant here. An overly zealous exam- 
iner might capture the letter of the law, so to speak, 
by adhering literally and strictly to testing proce- 
dures outlined in the publisher’s manual. But is this 


really what most test publishers intend? Is it even 
how the test was actually administered to the nor- 
mative sample? Most likely publishers would pre- 
fer that examiners capture the spirit of the law even 
if, on occasion, it is necessary to adjust testing pro- 
cedures slightly. 

Consider the following situation, which arose 
when a psychologist administered a standardized 
intelligence test to an anxious and overly concrete 
college student. When asked, “How much is four 
dollars and five dollars?” the student replied, “Four 
dollars is four dollars and five dollars is five dol- 
lars.” A literal interpretation of the test manual 
would require that the examiner record zero credit 
and proceed to the next item. However, the question 


was intended to test arithmetical skills, not con- 
creteness of thinking. Thus, the examiner asked the 
question again with a slight change in emphasis: 
“How much is four dollars and five dollars?” The 
subject guffawed loudly and answered immedi- 
ately, “Nine dollars—I didn’t realize it was an arith- 
metic question.” 

The need to adjust standardized procedures for 
testing is especially apparent when examining per- 
sons with certain kinds of disabilities. A subject 
with a speech impediment might be allowed to 
write down the answers to orally presented ques- 
tions or to use gesture and pantomime in response 
to some items. For example, a test question might 
ask, “What shape is a ball?” The question is de= 
signed to probe the subject’s knowledge of com- 
mon shapes, not to examine whether the examinee 
can verbalize “round.” The written response round 
and the gestured response (a circular motion of the 
index finger) are equally correct, too. 

Minor adjustments in procedures that heed the 
spirit in which a test was developed occur on a 
regular basis and are no cause for alarm. These 
minor adjustments do not invalidate the established 
norms—on the contrary, the appropriate adaptation 
of procedures is necessary so that the norms remain 
valid. After all, the testers who collected data from 
the standardization sample did not act like heartless 
robots when posing questions to subjects. Examin- 
ers who wish to obtain valid results must likewise 
exercise a reasoned flexibility in testing procedures. 

However, considerable clinical experience is 
needed to determine whether an adjustment in pro- 
cedure is minor or so substantial that existing norms 
no longer apply. This is why psychological exam- 
iners normally receive extensive supervised experi- 
ence before they are allowed to administer and 
interpret individual tests of ability or personality. 

In certain cases an examiner will knowingly de- 
part from standard procedures to a substantial de- 
gree; this practice precludes the use of available test 
norms. In these instances, the test is used to help 
formulate clinical judgments rather than to de- 
termine a quantitative index. For example, when 
examining aphasic patients, it may be desirable to 
ignore time limits entirely and accept roundabout 
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answers (Eisenson, 1954). The examiner might not 
even calculate a score. In these rare cases, the test 
becomes, in effect, an adjunct to the clinical inter- 
view. Of course, when the examiner does not ad- 
here to standardized procedures, this should be 
stated explicitly in the written report: 






| |DESIRABLE PROCEDURES 
||| OF TEST ADMINISTRATION 


A small treatise could be written on desirable pro- 
cedures of test administration, but we will have to 
settle for a brief listing of the most essential points. 
For more details, the interested reader can consult 
Sattler (2001) on the individual testing of children 
and Clemans (1971) on group testing. We discuss 
individual testing first, then briefly list some im- 
portant points about desirable procedures in group 
testing. 

An essential component of individual testing is 
that examiners must be intimately familiar with the 
materials and directions before administration be- 
gins. Largely this involves extensive rehearsal and 
anticipation of unusual circumstances and the ap- 
propriate response. A well-prepared examiner has 
memorized key elements of verbal instructions and 
is ready to handle the unexpected. 

The uninitiated student of assessment often as- 
sumes that examination procedures are so simple 
and straightforward that a quick once-through read- 
ing of the manual will suffice as preparation for test- 
ing. Although some individual tests are exceedingly 
rudimentary and uncomplicated, many of them 
have complexities of administration that, unheeded, 
can cause the examinee to fail items unnecessarily. 
For example, Choi and Proctor (1994) found that 25 
of 27 graduate students made serious errors in the 
administration of the Stanford-Binet: Fourth Edi- 
tion, even though the sessions were videotaped and 
the students knew their testing skills were being 
evaluated. Appropriate attention to the details of ad- 
ministration is essential for valid results. 

The necessity for intimate familiarity with test- 
ing procedures is well illustrated by the Block 
Design subtest of the WAIS-III (Wechsler, 1997). 
The materials for the subtest include nine blocks 
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(cubes) colored red on two sides, white on two 
sides, and red/white on two sides. The examinee’s 
task is to use the blocks to construct patterns de- 
picted on cards. For the initial designs, four blocks 
are needed, while for more difficult designs, all 
nine blocks are provided (Figure 2.3). 

Bright examinees have no difficulty compre- 
hending this task and the exact instructions do not 
influence their performance appreciably. However, 
persons whose intelligence is average or below av- 
erage need the elaborate demonstrations and cor- 
rections that are specified in the WAIS-III manual 
(Wechsler, 1997). In particular, the examiner 
demonstrates the first two designs and responds to 
the examinee’s success or failure on these accord- 
ing to a complex flow of reaction and counterreac- 
tion, as outlined in three pages of instructions. Woe 
to the tester who has not rehearsed this subtest and 
anticipated the proper response to examinees who 
falter on the first two designs. 


Sensitivity to Disabilities 


Another important ingredient of valid test adminis- 
tration is sensitivity to disabilities in the examinee. 
Impairments in hearing, vision, speech, or motor 





FIGURE 2.3 Materials Similar to WAIS-IH 
Block Design Subtest 


control may seriously distort test results. If the ex- 
aminer does not recognize the physical disability 
responsible for the poor test performance, a subject 
may be branded as intellectually or emotionally im- 
paired when, in fact, the essential problem is a sen- 
sory or motor disability. 

Vernon and Brown (1964) reported the tragic 
case of a young girl who was relegated to a hospi- 
tal for the mentally retarded as a consequence of 
the tester’s insensitivity to physical disability. The 
examiner failed to notice that the child was deaf and 
concluded that her Stanford-Binet IQ of 29 was 
valid. She remained in the hospital for five years, 
but was released after she scored an IQ of 113 ona 
performance-based intelligence test! After dis- 
missal from the hospital, she entered a school for 
the deaf and made good progress. 

Persons with disabilities may require special- 
ized tests for valid assessment. The reader will en- 
counter a lengthy discussion of available tests for 
exceptional examinees in Topic 7A (Testing Spe- 
cial Populations). In this section, we concentrate on 
the vexing issues raised when standardized tests for 
normal populations are used with mildly or moder- 
ately disabled subjects. We include separate dis- 
cussions of the testing process for examinees with 
a hearing, vision, speech, or motor control problem. 
However, the reader needs to know that many ex- 
ceptional examinees have multiple disabilities 
(Tweedie & Shroyer, 1982). 

Valid testing of a subject with a hearing impair- 
ment requires first of all that the examiner detect 
the existence of the disability! This is often more 
difficult than it seems. Many persons with mild 
hearing loss learn to compensate for this disability 
by pretending to understand what others say and 
waiting for further conversational cues to help clar- 
ify faintly perceived words or phrases. As a result, 
other persons—including psychologists—may not 
perceive that an individual with mild hearing loss 
has any disability at all. 

Failure to notice a hearing loss is particularly a 
problem with young examinees, who are usually 
poor informants about their disabilities. Young chil- 
dren are also prone to fluctuating hearing losses 
due to the periodic accumulation of fluid in the 


middle ear during intervals of mild illness (Vernon 
& Alles, 1986). A child with a fluctuating hearing 
loss may have normal hearing in the morning, but 
perceive conversational speech as a whisper just a 
few hours later. 

Indications of possible hearing difficulty in- 
clude lack of normal response to sound, inattentive- 
ness, difficulty in following oral instructions, intent 
observation of the speaker’s lips, and poor articula- 
tion (Sattler, 1988). In allcases in which hearing im- 
pairment is suspected, referral for an audiological 
examination is crucial. If a serious hearing problem 
is confirmed, then the examiner should consider 
using one of the specialized tests discussed in Topic 
7A, Testing Special Populations. In persons with a 
mild hearing loss, it is essential for the examiner to 
face the subject squarely, speak loudly, and repeat 
instructions slowly. It is also important to find a 
quiet room for testing. Ideally, a testing room will 
have curtains and textured wall surfaces to minimize 
the distracting effects of background noises. 

In contrast to those with hearing loss, subjects 
with visual disabilities generally attend well to ver- 
bally presented test materials. The examinee with 
visual impairment introduces a different kind of 
challenge to the examiner: detecting that a visual 
impairment exists, and then ensuring that the sub- 
ject can see the test materials well. 

Detecting visual impairment is a straightfor- 
ward matter with adult subjects—in most cases, a 
mature examinee will freely volunteer information 
about visual impairment, especially if asked. How- 
ever, children are poor informants about their vi- 
sual capacities, so testers need to know the signs 
and symptoms of possible visual impairment in a 
young examinee. Common sense is a good starting 
point: Children who squint, blink excessively, or 
lose their place when reading may have a vision 
problem. Holding books or testing materials up 
close is another suspicious sign. Blurred or double 
vision may signify visual problems, as:may head- 
aches or nausea after reading. In general, it is so 
common for children to require corrective lenses 
that examiners should be on the lookout for a vision 
problem in any young subject who does not wear 
glasses and has not had a recent vision exam. 
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Depending upon the degree of visual impair- 
ment, examiners need to make corresponding ad- 
justments in testing. If the child’s vision is of no 
practical use, special instruments with appropriate 
norms must be used. For example, the Perkins-Binet 
is available for testing children who are blind. These 
tests are discussed in Topic 7A (Testing Special 
Populations). For obvious reasons, only the verbal 
portions of tests should be administered to sighted 
children with an uncorrected visual problem. 

Speech impairments present another problem 
for diagnosticians. The verbal responses of subjects 
with speech impairment are difficult to decipher. 
Owing to the failed comprehension of the exam- 
iner, subjects may receive less credit than is due. 
Sattler (1988) relates the lamentable case of Daniel 
Hoffman, a youngster with speech impairment who 
spent his entire youth in classes for those with men- 
tal retardation because his Stanford-Binet IQ was 
74. In actuality, his intelligence was within the nor- 
mal range, as revealed by other performance-based 
tests. In another tragic miscarriage of assessment, 
a patient in England was mistakenly confined to a 
ward for those with severe retardation because 
cerebral palsy rendered his speech incomprehen- 
sible. The patient was wheelchair-bound and had 
almost no motor control, so his performance on 
nonverbal tests was also grossly impaired. The staff 
assumed he was severely retarded, so the patient 
remained on the back ward for decades. However, 
he befriended a fellow resident who could compre- 
hend the patient’s gutteral rendition of the alpha- 
bet. The friend was severely retarded but could 
nonetheless recognize keys on a typewriter. With 
laborious letter-by-letter effort, the patient with in- 
capacitating cerebral palsy wrote and published an 
autobiography, using his friend) with mental dis- 
ability as a conduit to the real world. 

Even if their disability is mild, persons with 
cerebral palsy or other motor impairments may be 
penalized by timed performance tests. When test- 
ing a person with a mild motor disability, examin- 
ers may wish to omit timed performance subtests, 
or to discount these results if they are consistently 
lower than scores from untimed subtests. If a sub- 
ject has an obvious motor disability—such as a 
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difficulty in manipulating the pieces of a puzzle— 
then standard instruments administered in the nor- 
mal manner are largely inappropriate. A number of 
alternative instruments have been developed ex- 
pressly for examinees with cerebral palsy and other 
motor impairments, and standard tests have been 
cleverly adapted and renormed (Topic 7A, Testing 
Special Populations). 


Desirable Procedures of Group Testing 


Psychologists and educators commonly assume 
that almost any adult can accurately administer 
group tests, so long as he or she has the requisite 
manual. Administering a group test would appear 
to be a simple and straightforward procedure of 
passing out forms and pencils, reading instructions, 
keeping time, and collecting the materials. 

In reality, conducting a group test requires as 
much finesse as administering an individual test, a 
point recognized years ago by Traxler (1951): 


It is doubtful if any educational institution would 
trust the administration and scoring of the individ- 
ual Stanford-Binet Scale to anyone other than a 
trained psychometrician, but there is a rather gen- 
eral impression among school people that almost 
anyone can administer and score a group test if 
only he has a manual at hand. As a matter of fact, 
however, the administration of a group test to a 
class of pupils in one sense is a far more exacting 
procedure than the giving of an individual test to 
one pupil. The routine is more rigid and the penalty 
for error is multiplied by the number of individuals 
in the group. In the administration of an individual 
test a certain amount of leeway may be allowed in 
order to create a situation favorable for the eliciting 
of responses from that particular individual, but 
when testing a group, complete fidelity to all de- 
tails of the prescribed procedure is imperative if the 
examiner is to avoid wasting the time of many indi- 
viduals and if the results are to have the same 
meaning for all. 


There are numerous ways in which careless ad- 
ministration and scoring can impair group test re- 
sults, causing bias for the entire group or affecting 
only certain individuals. We outline only the more 
important inadequacies and errors in the following 


paragraphs, referring the reader to Traxler (1951) 
and Clemans (1971) for a more complete discussion. 

Undoubtedly the greatest single source of error 
in group test administration is incorrect timing of 
tests that require a time limit. Examiners must allot 
sufficient time for the entire testing process: setup, 
reading instructions out loud, and the actual test 
taking by examinees, Allotting sufficient time re- 
quires foresightful scheduling: For example, in 
many school settings, children must proceed to the 
next class at a designated time, regardless of ongo- 
ing activities: Inexperienced examiners might be 
tempted to cut short the designated time limit for a 
test so that the school schedule can be maintained. 
Of course, reduced time on a test renders the norms 
completely invalid and likely lowers the score for 
most subjects in the group. 

Allowing too much time for a test can be an 
equally egregious error. For example, consider the 
impact of receiving extra time on the Miller Analo- 
gies Test (MAT), a high-level reasoning test once 
required by many universities for graduate school 
application. Since the MAT is a speeded test that 
requires quick analogical thinking, extra time 
would allow most examinees to solve several extra 
problems. This kind of testing error would likely 
lower the validity ofthe MAT results as a predictor 
of graduate school performance. 

A second source of error in group test adminis- 
tration is lack of clarity in the directions to the ex- 
aminees. Examiners must read the instructions 
slowly in a clear, loud voice that commands the at- 
tention of the subjects. Instructions must not be 
paraphrased. Where allowed by the manual, exam- 
iners must stop and clarify points with individual 
examinees who are confused. 

Variations in the physical conditions under 
which tests are given is a third source of potential 
error in group test administration. Examiners must 
ensure that the testing room is well illuminated and, 
if needed, heated or air-conditioned to control ex- 
treme variations in temperature and humidity. Cle- 
mans (1971) has noted that test authors seldom go 
into detailed specifications concerning illumina- 
tion, temperature, and humidity, since examiner 
and subjects, with few exceptions, will have to put 


up with the conditions that exist. Nonetheless, it is 
obvious that examinees cannot perform optimally 
if tested in a dimly lit room that is too cold or op- 
pressively hot and humid. Foresightful test admin- 
istrators should do their examinees a favor by 
scheduling important group tests in a pleasant and 
well-illuminated environment. 

The quality of the writing surface can be cru- 
cially important for valid group testing, especially 
for young subjects. Traxler’s (1951) point that 
schools vary widely in their facilities for the ad- 
ministration of group tests is valid even today: 

In the matter of writing space alone, some schools 

use large, comfortable tables, others use desks, oth- 

ers armchairs, and still others give their tests in the 
auditorium with each pupil writing on a portable 
beaverboard “desk,” or even on his lap. It is not 
reasonable to expect fully comparable results under 
such varying conditions. 


The importance of the writing surface is magnified 
by the current tendency to use separate answer 
sheets. Subjects need a wider desk space than oth- 
erwise when employing separate answer sheets. Al- 
though few test publishers do so, it would be wise 
to specify in test manuals the permissible variations 
in writing surfaces that still allow for comparable 
test results. 

Noise is another factor that must be controlled 
in group testing. It has been known for some time 
that noise causes a decrease in performance, espe- 
cially for tasks of high complexity (e.g., Boggs & 
Simon, 1968). Surprisingly, there is little research 
on the effects of noise on psychological tests. How- 
ever, it seems almost certain that loud noise, espe- 
cially if intermittent and unpredictable, will cause 
test scores to decline substantially. Elementary 
schoolchildren should not be expected to perform 
well while a construction worker jackhammers a 
cement wall in the next room. In fairness to the ex- 
aminees, there are times when the test administra- 
tor should reschedule the test. 

A fourth source of error in the administration of 
a group test is failure to explain when and if exam- 
inees should guess. Perhaps more frequently than 
any other question, examiners are asked, “Is there a 
penalty if I guess wrong?” In most instances, test 
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developers anticipate this issue and provide explicit 
guidance to subjects as to the advantages and/or pit- 
falls of guessing. Examiners should not give supple- 
mentary advice on guessing—this would constitute 
a serious deviation from standardized procedure. 

Most test developers incorporate a correction 
for guessing based on established principles of 
probability. Consider a multiple-choice test that has 
four alternatives per item. On those items that the 
subject makes a wild, uneducated guess, the odds 
on being correct are 1 out of 4, while the odds on 
being wrong are 3 out of 4. Thus, for every three 
wrong guesses, there will be one correct guess that 
reflects luck rather than knowledge. Suppose a 
young girl answers correctly on 35 questions from 
a 50-item test but answers erroneously on 9 ques- 
tions. In all she has answered 44 questions, leaving 
6 blank. The fact that she selected the wrong alter- 
native on 9 questions suggests that she also gained 
3 correct answers due to luck rather than knowl- 
edge. Remember, on wild guesses we expect there 
to be, on average, 3 wrong answers for every cor- 
rect answer, so for 9 wrong guesses we would ex- 
pect 3 correct guesses on other questions. The 
subject’s corrected score—the one actually re- 
ported and compared to existing norms—would 
then be 32; that is, 35 minus 3. In other words, she 
probably knew 32 answers but by guessing on 12 
others she boosted her score another 3 points. 

The scoring correction outlined in the preced- 
ing paragraph pertains only to wild, uneducated 
guesses. The effect of such a correction is to elim- 
inate the advantage otherwise bestowed on un- 
abashed risk takers. However, not all guesses are 
wild and uneducated. In some instances, an exam- 
inee can eliminate one or two of the alternatives, 
thereby increasing the odds of a correct guess 
among the remaining choices. In this situation, it 
may be wise for the examinee to guess. 

Whether an educated guess is really to the ad- 
vantage of the examinee depends partly on the dia- 
bolical skill of the item writer. Traxler (1951) notes: 


In effect, the item writer attempts to make each 
wrong response so plausible that every examinee 
who does not possess the desired skill or ability 
will select a wrong response. Jn other words, the 
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item writer’s aim is to make all or nearly all con- 
sidered guesses wrong guesses. 


A skilled item writer can fashion questions so that 
the correct alternative is completely counterintu- 
itive and the wrong alternatives are persuasively ap- 
pealing. For these items, an educated guess is 
almost always wrong. 

Nonetheless, many test developers now advise 
subjects to make educated guesses, but warn 
against wild guesses. For example, a recent edi- 
tion of the test preparation manual Taking the SAT 
advises: 


Because of the way the test is scored, haphazard or 
random guessing for questions you know nothing 
about is unlikely to change your score. When you 
know that one or more choices can be eliminated, 
guessing from among the remaining choices should 
be to your advantage. 


Whether or not a group test uses a scoring correction, 
the important point to emphasize in this context is 
that the administrator should follow standardized 
procedure and never offer supplementary advice 
about guessing. In group testing, deviations from the 
instructions manual are simply unacceptable. 


| INFLUENCE OF THE EXAMINER 
The Importance of Rapport 


Test publishers urge examiners to establish 
rapport—a comfortable, warm atmosphere that 
serves to motivate examinees and elicit coopera- 
tion. Initiating a cordial testing milieu is a crucial 
aspect of valid testing. A tester who fails to estab- 
lish rapport may cause a subject to react with 
anxiety, passive-aggressive noncooperation, or 
open hostility. Failure to establish rapport distorts 
test findings: Ability is underestimated and person- 
ality is misjudged. 

Rapport is especially important in individual 
testing and particularly so when evaluating chil- 
dren. Wechsler (1974) has noted that establishing 
rapport places great demands on the clinical skills 
of the tester: 


He must put the child at ease, keep him interested 
in the tasks at hand, and encourage him to do his 
best. There is no magic formula for “reaching” a 
child; approaches that succeed with some children 
may antagonize others. With experience, the exam- 
iner will develop a perceptiveness enabling him to 
establish sympathetic relations with children and to 
adapt to the specific needs of each one. The general 
suggestions below are offered to aid the examiner 
in this endeavor. 

To put the child at ease in his surroundings, the 
examiner might engage him in some informal con- 
versation before getting down to the more serious 
business of giving the test. Talking to him about his 
hobbies or interests is often a good way of breaking 
the ice, although it may be better to encourage a 
shy child to talk about something concrete in the 
environment—a picture on the wall, an animal in 
his classroom, or a book or toy (not a test material) 
in the examining room. In general, this introductory 
period need not take more than 5 to 10 minutes, al- 
though the testing should not start until the child 
seems relaxed enough to give his maximum effort. 


A study by Gregory, Lehman, and Mohan 
(1976) illustrates the importance of establishing 
rapport when testing children. These researchers 
sought to determine the effects of low-level lead ex- 
posure on IQ by administering the Wechsler Intel- 
ligence Scale for Children (WISC) to 193 children 
living near a lead smelter. Children were assigned 
to five different graduate-student testers on a quasi- 
random, first-come, first-served rotational basis. 
The groups of children tested by each of the five 
psychometricians did not differ in average age, lead 
exposure, or social class. Moreover, the sample 
sizes were substantial, ranging from 30 to 45. 
Hence, the average tested IQs of the five groups 
should have been highly similar. 

However, the differences between tested IQs of 
the five groups were distressingly large, with aver- 
age scores varying by as much as 14 points. Ranked 
from low to high, the average scores for the five 
groups were 90, 94, 95, 96, and 104. The tester 
whose subjects tested at an average IQ of 90 was 
very formal, precise, cold, and hurried. In fact, he 
tested the most subjects by far (45, compared to 37 
for the next-most prolific tester) and was usually 


finished with a child much sooner. At the other ex- 
treme was the tester whose subjects obtained an av- 
erage IQ of 104. He went beyond good rapport to 
offer support and encouragement that bordered on 
leading the subjects to the correct answer. For ex- 
ample, on Block Design he urged one child, “Come 
on, get the blocks in the corners and go forward 
from there.” 

Testers, then, may differ in their abilities to es- 
tablish rapport. Cold testers will likely obtain less 
cooperation from their subjects, resulting in re- 
duced performance on ability tests or distorted, de- 
fensive results on personality tests. Overly 
solicitous testers may err in the opposite direction, 
giving subtle (and occasionally blatant) cues to cor- 
rect answers. Both extremes should be avoided. 


Examiner Sex, Experience, and Race 


A wide body of research has sought to determine 
whether certain characteristics of the examiner 
cause examinee scores to be raised or lowered on 
ability tests. For example, does it matter whether 
the examiner is male or female? Experienced or 
novice? Same or different race from the examinee? 
We will contain the urge to review these studies— 
with a few exceptions—for one simple reason: The 


results are contradictory and therefore inconclu- 


sive. Most studies find that sex, experience, and 
race of the examiner make little, if any, difference. 
Furthermore, the few studies that report a large ef- 
fect in one direction (e.g., female examiners elicit 
higher IQ scores) are contradicted by other studies 
showing the opposite trend. The interested reader 
can consult Sattler (1988) for a discussion and ex- 
tensive listing of references. 

Yet, it would be unwise to conclude that sex, ex- 
perience, or race of the examiner never affect 
test scores. In isolated instances, a particular exam- 
iner characteristic might very well have a large ef- 
fect on examinee test scores. For example, Terrell, 
Terrell, and Taylor (1981). ingeniously demon- 
strated that the race of the examiner interacts po- 
tently with the trust level of African American 
examinees in IQ testing. These researchers identi- 
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fied African American college students with high 
and low levels of mistrust of whites; half of each 
group was then administered the WAIS by a white 
examiner, the other half by an African American ex- 
aminer. The high-mistrust group with an African 
American examiner scored significantly higher 
than the high-mistrust group with a white examiner 
(average IQs of 96 versus 86, respectively). In ad- 
dition, the low-mistrust group with a white exam- 
iner scored slightly higher than the low-mistrust 
group with an African American examiner (average 
IQs of 97 versus 92, respectively). In sum, the au- 
thors concluded that mistrustful African Americans 
do poorly when tested by white examiners. Data 
bearing on this type of racial effect are meager, and 
there is certainly room for additional research. 


BACKGROUND AND MOTIVATION 
OF THE EXAMINEE 


Examinees differ not only in the characteristics that 
examiners desire to assess, but also in other extra- 
neous ways that might confound the test results. 
Kor example, a bright subject might perform poorly 
on a speeded ability test because of test anxiety; a 
sane murderer might seek to appear mentally ill on 
a personality inventory to avoid prosecution; a stu- 
dent of average ability might undergo coaching to 
perform better on an aptitude test. Some subjects 
utterly lack motivation and don’t care if they do 
well on psychological tests. In all of these in- 
stances, the test results may be inaccurate because 
of the filtering and distorting effects of certain ex- 
aminee characteristics such as anxiety, malinger- 
ing, coaching, or cultural background. 


Test Anxiety 


Test anxiety refers to those phenomenological, 
physiological, and behavioral responses that ac- 
company concern about possible failure on a test. 
There is no doubt that subjects experience different 
levels of test anxiety ranging from a carefree out- 
look to incapacitating dread at the prospect of being 
tested. 
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Several true-false questionnaires have been 
developed to assess individual differences in test 
anxiety (e.g., Sarason, 1980; Morris, Davis, & 
Hutchings, 1981). Following, we list characteristic 
items and their direction of keying (T for True, F 
for False): 5 


(T) When taking an important examination I 
sweat a great deal. 

(T) I freeze up when I take intelligence tests 
or school exams. 

(F) I really don’t understand why some peo- 
ple get so upset about tests. 

(T) I dread courses in which the instructor 
likes to give “pop” quizzes. 


An extensive body of research has confirmed 
the commonsense notion that test anxiety is nega- 
tively correlated with school achievement, aptitude 
test scores, and measures of intelligence (Naveh- 
Benjamin, McKeachie, & Lin, 1987; McKeachie, 
1984). However, the interpretation of these corre- 
lational findings is not straightforward. One possi- 
bility is that students develop test anxiety because 
of a history of performing poorly on tests. That 
is, the decrements in performance may precede 
and cause the test anxiety. In support of this view- 
point, Paulman and Kennelly (1984) found that— 
independent of their anxiety—many test-anxious 
students also display ineffective test taking in aca- 
demic settings. Such students would do poorly on 
tests whether or not they were anxious. Moreover, 
Naveh-Benjamin et al. (1987) determined that a 
large proportion of test-anxious college students 
have poor study habits that predispose them to poor 
test performance. The test anxiety of these subjects 
is partly a by-product of lifelong frustration over 
mediocre test results. 

Other lines of research indicate that test anxi- 
ety has a directly detrimental effect on test perfor- 
mance. That is, test anxiety is likely both cause and 
effect in the equation linking it with poor test 
performance. Consider the seminal study on this 
topic by Sarason (1961), who tested high- and low- 
anxious subjects under neutral or anxiety-inducing 
instructions. The subjects were college students re- 


quired to memorize two-syllable words low in 
meaningfulness—a difficult task. Half of the sub- 
jects performed under neutral instructions—they 
were simply told to memorize the lists. The re- 
maining subjects were told to memorize the lists 
and told that the task was an intelligence test. They 
were urged to perform as well as possible. The two 
groups did not differ significantly in performance 
when the instructions were neutral and nonthreat- 
ening. However, when the instructions aroused 
anxiety, performance levels for the high-anxious 
subjects dropped markedly, leaving them at a huge 
disadvantage compared to low-anxious subjects. 
This indicates that test-anxious subjects show sig- 
nificant decrements in performance when they per- 
ceive the situation as a test. In contrast, low-anxious 
subjects are relatively unaffected by such a simple 
redefinition of the context. 

Tests with narrow time limits pose a special 
problem to persons with high levels of test anxiety. 
Time pressure seems to exacerbate the degree of 
personal threat, causing significant reductions in 
the performance of test-anxious persons. Siegman 
(1956) demonstrated this point many years ago by 
comparing performance levels of high- and low- 
anxious medical/psychiatric patients on timed and 
untimed subtests from the WAIS. The WAIS con- 
sists of eleven subtests, including six subtests for 
which the examiner uses a stopwatch to enforce 
strict time limits, and five subtests for which the 
subject has unlimited time to respond. Interest- 
ingly, the high- and low-anxious subjects were of 
equal overall ability on the WAIS. However, each 
group excelled on different kinds of subtests in pre- 
dictable directions. In particular, the low-anxious 
subjects surpassed the high-anxious subjects on 
timed subtests, whereas the reverse pattern was ob- 
served on untimed subtests (Figure 2.4). 


Motivation to Deceive 


Test results may be inaccurate if the subject has 
reasons to perform in an inadequate or unrepresen- 
tative manner. Overt faking of test results is rare, 
but it does happen. A small fraction of persons 
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FIGURE 2.4 Influence of Timing and Anxiety Level on 
WAIS Subtest Results 

Source: Based on data from Siegman, A. W. (1956). The effect of 
manifest anxiety on a concept formation task, a nondirected learn- 
ing task, and on timed and untimed intelligence tests. Journal of 
Consulting Psychology, 20, 176-178. 


seeking benefits from rehabilitation or social agen- 
cies will consciously fake bad on personality and 
ability tests. Occasionally, persons who anticipate 
criminal prosecution will fake mental illness on 
personality tests. Consider the case of the psy- 
chotherapy client who took a personality test at the 
behest of his therapist. The therapist desired an ac- 
curate assessment of the client’s seemingly mild 
depression. The results were ambiguous, indicating 
either a monumental degree of psychological dis- 
turbance or a conscious attempt to exaggerate 
symptoms. Two weeks later the therapist inadver- 
tently discovered that the client was about to be 
charged with child molestation. Apparently, he had 
faked the test results, anticipating that legal charges 
would soon be filed against him. He planned to de- 
fend himself, in part, by claiming that mental ill- 
ness was a mitigating factor in his behavior. 

In most cases, a well-trained psychometrician 
can detect conscious faking by asking two ques- 
tions: (1) Does the client have motivation to per- 
form deceitfully on the tests? (2) Is the overall 
pattern of test results suspicious in light of other in- 
formation known about the client? If the answer to 
both questions is “yes,” then the examiner is well 
advised to approach the test results with skepticism. 
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Effects of Coaching on Test Results 


The influence of coaching on test scores has been 
widely studied by psychologists and educators. 
Coaching may include several components: extra 
practice on testlike materials, review of fundamen- 
tal concepts likely to be covered by the test, and ad- 
vice about optimal test-taking strategies. We can do 
no more than mention a few highlights here, with 
special emphasis on aptitude tests. Readers who 
wish more detail can consult Anastasi (1981) and 
Bond (1989). 

There is no doubt that coaching can improve 
scores significantly on certain kinds of aptitude tests 
containing “coachable” test item types. For exam- 
ple, Powers and Swinton (1984) mailed various sets 
of test preparation materials to random samples of 
Graduate Record Examination (GRE) candidates 
approximately five weeks before the test adminis- 
tration.! The test preparation materials included 
extra practice tests, explanations to practice test 
questions, and hints or strategies for answering dif- 
ferent item types. Incidentally, GRE test scores gen- 
erally range from 200 to 800 with a national average 
of approximately 500. At the time of this study, the 
general test included three sections: Verbal, Quanti- 
tative, and Analytical. By comparing the perfor- 
mance of the experimental subjects with control 
subjects who received no supplementary test prepa- 
ration materials, Powers and Swinton (1984) were 
able to deduce that four hours of special self-tutored 
preparation yielded a dividend of a 53-point increase 
on the GRE Analytical Test scale. This finding was 
comparable to a 66-point increase in an earlier in- 
structor-based intervention that entailed seven hours 
of direct contact (Swinton & Powers, 1983). 

However, the effects observed in these studies 
were highly specific. The special preparation ma- 
terials for analytical items acted only on the ana- 
lytical scores, not on verbal or quantitative scores. 
In fact, the effects were restricted only to the 


1. The GRE is required by many graduate school admission 
committees. During the period when Powers and Swinton 
(1984) studied the GRE, it consisted of verbal, quantitative, and 
analytical portions. 
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particular kinds of analytical items that were 
coachable—items involving analysis of explana- 
tions and logical diagrams. Performance on other 
kinds of analytical items—for example, analytical 
reasoning—was unaffected. 

Coaching presents a serious problem to test de- 
velopers because it can inflate an examinee’s score 
without correspondingly improving his or her over- 
all abilities in the domain being tested. If this oc- 
curs, the score is no longer a valid representation of 
an examinee’s ability. Thus, coaching may invali- 
date existing norms. Certainly, coaching raises the 
specter of privileged test preparation for rich or 
savvy students at the expense of those who are poor 
or uninformed. After all, privately established 
coaching schools abound, but they may charge hun- 
dreds of dollars for a day or two of preparation for 
aptitude exams. 

Fortunately, there is a partial solution to the 
problem of coaching that many test developers are 
embracing: make self-tutored coaching available to 
everyone. Most major testing programs, including 
the Graduate Record Examinations, now provide 
sample test materials so that examinees can become 
familiar with the nature of the questions. Of course, 
this does not guarantee that everyone will make use 
of the materials, but at least the opportunity is im- 
mediately available at no cost. Students who refuse 
to inspect test familiarization materials do so at 
their own risk. 


||| ISSUES IN SCORING 


Group tests generally employ a multiple-choice for- 
mat for which the examinee pencils in responses on 
a separate answer sheet that is then machine scored 
with total objectivity and accuracy. Consequently, 
group tests seldom present an opportunity for human 
error in scoring. However, scoring errors can creep 
in with machine-scored group tests if examinees do 
not sufficiently darken the response areas with a soft 
lead pencil or if they leave extraneous marks on the 
answer sheet. To guard against this source of error, 
testers must inspect every answer sheet and correct 
any irregularities in pencil markings before submit- 
ting materials for machine scoring. 


Scoring errors occur mainly with individual 
tests for which the examiner must make scoring 
judgments, add columns of numbers, and consult ta- 
bles that convert raw scores into IQs or other sum- 
mary statistics. Contrary to popular belief, examiner 
judgments about scoring—whether a certain re- 
sponse on an IQ test merits one or two raw score 
points, for exampie—are almost never a significant 
source of scoring errors. We illustrate this point with 
individual IQ testing, but it applies to most other in- 
dividual tests as well. Even when examiners bla- 
tantly err in a conservative direction, consistently 
giving a subject less credit than deserved for am- 
biguous or borderline answers, the net effect on Full 
Scale IQ is minimal, perhaps one or two points at 
most. The reason that judgment errors seldom make 
a serious difference is simple: Scoring criteria on 
most tests are spelled out in such detail that the ex- 
aminer is seldom required to make a judgment call. 

Clerical scoring errors are another matter alto- 
gether. These kinds of errors occur far more often 
than even psychologists want to admit. As we shall 
see, clerical scoring errors can have disastrous 
effects. 

Ryan, Prifitera, and Powers (1983) asked 19 
psychologists and 20 graduate students to score the 
WAIS-R protocols of two vocational counseling 
clients. For one client whose Full Scale IQ was 110, 
the practicing psychologists tallied scores ranging 
from 107 to 115, and the graduate students obtained 
scores ranging from 108 to 117. The variations in 
scoring were due largely to clerical scoring errors, 
not judgmental differences about credit due for am- 
biguous or borderline answers. Gregory (1987) has 
illustrated much the same point with a group of ad- 
vanced graduate students who erred by as much as 
30 points when scoring a standard Wechsler IQ test 
protocol. We may surmise, then, that scoring errors 
do occur frequently and do seriously compromise 
the accuracy of IQ assessment and other forms of 
psychological testing. This problem does not dis- 
appear merely because of increased experience. 
The only way to avoid clerical scoring errors is to 
publicize how widespread the problem is and en- 
courage examiners to exercise great care when 
scoring protocols. 
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SUMMARY 


1. Standardized testing procedures are essen- 
tial to valid testing. The use of nonstandard proce- 
dures may alter the meaning of the test results, 
rendering them invalid and misleading. 


2. Flexibility in testing procedures is none- 
theless appropriate when it is reasoned and delib- 
erate. In determining whether a flexible shift in 
testing procedures is acceptable, the examiner 
should surmise how the test was most likely ad- 
ministered to the normative sample. 


3. In individual testing, it is desirable for the 
examiner to become highly familiar with the test 
materials. Tests need to be rehearsed so that the ex- 
aminer can anticipate the appropriate responses to 
the numerous contingencies of testing. 


4. Another important ingredient of valid test- 
ing is sensitivity to disabilities in the examinee. 
When disabilities go unrecognized, serious errors 
of test interpretation may occur; for example, a deaf 
examinee may be misdiagnosed as having a mental 
disability. 

5. Inthe administration of group tests, exam- 
iners must adhere strictly to oral instructions and 
defined time limits. In addition, the physical con- 
ditions of testing must be appropriate, such as 
proper lighting and minimal noise. 

6. Especially in the administration of individ- 
ual tests, examiners are urged to establish rapport. 
In testing, rapport is a comfortable, warm atmos- 
phere that serves to motivate examinees and elicit 
cooperation. 

7. Contrary to popular expectation, most 
studies find that the sex, experience, and race of the 
examiner have little effect upon psychological test 


results. Nonetheless, there may be specialized 
cases in which examiner—examinee interactions 
produce a detrimental effect upon test scores. 


8. Test anxiety refers to those phenomeno- 
logical, physiological, and behavioral responses 
that accompany concern about possible failure on 
a test. Test anxiety has been shown to correlate 
negatively with school achievement, aptitude test 
scores, measures of intelligence, and performance 
on timed tests. 


9. Faking of test results is rare, but does 
occur. In most cases, a well-trained examiner can 
detect conscious faking by asking whether the 
client has motivation to perform deceitfully on the 
tests and whether the overall pattern of test results 
is suspicious in light of other information. 


10. Coaching can improve examinee test 
scores for certain kinds of coachable items such as 
the analytical portion of the Graduate Record Ex- 
amination (GRE). Practice with sample test items 
and learning hints or strategies can boost a GRE 
subtest score by 50 to 60 points on the 800-point 
scale. 


11. In scoring individual tests, differences in 
personal judgment—for example, assigning a cer- 
tain test response one raw score point versus two 
points—rarely influence the overall test score to 
any appreciable extent, such as one or two Full 
Scale IQ points at most. 


12. Clerical scoring errors—such as adding 
columns of scores incorrectly or consulting the 
wrong reference table—pose a serious problem in 


- individual psychological testing. These kinds of er- 


rors can cause test scores to be wildly inaccurate. 
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Summary 
Key Terms and Concepts 


Te chapter concerns two basic concepts 
needed to facilitate the examiner’s interpre- 
tation of test scores: norms and reliability. In most 
cases, scores on psychological tests are interpreted 
by reference to norms that are based upon the dis- 
tribution of scores obtained by a representative 
sample of examinees. In Topic 3A, Norms and Test 
Standardization, we review the process of stan- 
dardizing a test against an appropriate norm group 
so that test users can make sense out of individual 
test scores. Since the utility of a test score is also 
determined by the consistency or repeatability of 
test results, we introduce the essentials of reliabil- 
ity theory and measurement in Topic 3B, Concepts 
of Reliability. The next chapter flows logically 
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from the material presented here and investigates 
the complex issues of validity—does a test measure 
what it is supposed to measure? First, we begin 
with the more straightforward issues of establish- 
ing a comparative frame of reference (norms) and 
determining the consistency or repeatability of test 
results (reliability). 

The initial outcome of testing is typically a 
raw score such as the total number of personality 
statements endorsed in a particular direction or 
the total number of problems solved correctly, per- 
haps with bonus points added in for quick solu- 
tions. In most cases, the initial score is useless 
by itself. For test results to be meaningful, exam- 
iners must be able to convert the initial score to 


some form of derived score based upon compari- 
son to a standardization or norm group. The vast 
majority of tests are interpreted by comparing 
individual results to a norm group performance; 
criterion-referenced tests are an exception, dis- 
cussed subsequently. 

A norm group consists of a sample of exami- 
nees who are representative of the population 
for whom the test is intended. Consider a word 
knowledge test designed for use with prospective 
college first-year students. In this case, the per- 
formance of a large, heterogeneous, nationwide 
sampling of such persons might be collected for 
purposes of standardization. The essential objective 
of test standardization is to determine the distribu- 
tion of raw scores in the norm group so that the test 
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developer can publish derived scores known as 
norms. Norms come in many varieties, for example, 
percentile ranks, age equivalents, grade equivalents, 
or standard scores, as discussed in the following. In 
general, norms indicate an examinee’s standing on 
the test relative to the performance of other persons 
of the same age, grade, sex, and so on. 

To be effective, norms must be obtained with 
great care and constructed according to well-known 
precepts discussed in the following. Furthermore, 
norms may become outmoded in just a few years, so 
periodic renorming of tests should be the rule, not the 
exception (Case Exhibit 3.1). We approach the topic 
of norms indirectly, first providing the reader with a 
discussion of raw scores and then reviewing statisti- 
cal concepts essential to an understanding of norms. 





58 _CHAPTER3 NORMS AND RELIABILITY 


[||| raw scores 


The most basic level of information provided by a 
psychological test is the raw score. For example, 
in personality testing, the raw score is often the 
number of questions answered in the keyed direc- 
tion for a specific scale. In ability testing, the raw 
score commonly consists of the number of prob- 
lems answered correctly, often with bonus points 
added for quick performance. Thus, the initial out- 
come of testing is almost always a numerical tally 
such as 17 out of 44 items answered in the keyed 
direction on a depression scale, or 29 of 55 raw 
score points earned on the block design subscale of 
an intelligence test. 

However, it should be obvious to the reader that 
raw scores, in isolation, are absolutely meaning- 
less. For example, what use is it to know that a sub- 
ject correctly solved 12 of 20 abstract reasoning 
questions? What does it mean that an examinee re- 
sponded in the keyed direction to 19 out of 33 true- 
false questions from a psychological-mindedness 
scale? 

It is difficult to even think about such questions 
without resorting to comparisons of one variety or 
another. We want to know how others have done on 
these tests, whether the observed scores are high or 
low in comparison to a representative group of sub- 
jects. In the case of ability tests, we are curious 
whether the questions were easy or hard, especially 
in relation to the age of the subject. 

In fact, it seems almost a truism that a raw score 
becomes meaningful mainly in relation to norms, 
an independently established frame of reference 
derived from a standardization sample. We have 
much to say about the derivation and use of norms 
later in this unit. For now it will suffice to know that 
norms are empirically established by administering 
a test to a large and representative sample of per- 
sons. An examinee’s score is then compared to the 
distribution of scores obtained by the standardiza- 
tion sample. In this manner, we determine from the 
norms whether an obtained score is low, average, or 
high. 

The vast majority of psychological tests are 
interpreted by consulting norms; as noted, these 


instruments are called norm-referenced tests. How- 
ever, the reader is reminded that other kinds of 
instruments do exist. In particular, criterion-refer- 
enced tests help determine whether a person can ac- 
complish an objectively defined criterion such as 
adding pairs of two-digit numbers with 97 percent 
accuracy. In the case of criterion-referenced tests, 
norms are not essential. We elaborate upon crite- 
rion-referenced tests at the end of this topic. 

There are many different kinds of norms, but 
they share one characteristic: Each incorporates a 
statistical summary of a large body of scores. Thus, 
in order to understand norms, the reader needs to 
master elementary descriptive statistics. We take a 
modest digression here to review essential statisti- 
cal concepts. 


| ESSENTIAL STATISTICAL CONCEPTS 


Suppose for the moment that we have access to a 
high-level vocabulary test appropriate for testing 
the verbal skills of college professors and other pro- 
fessional persons (Gregory & Gernert, 1990). The 
test is a multiple-choice quiz of 30 difficult words 
such as welkin, halcyon, and mellifluous. A curious 
professor takes the test and chooses the correct al- 
ternative for 17 of the 30 words. She asks how her 
score compares to others of similar academic stand- 
ing. How might we respond to her question? 

One manner of answering the query would be 
to give her a list of the raw scores from the prelim- 
inary standardization sample of 100 representative 
professors at her university (Table 3.1). However, 
even with this relatively small norm sample (thou- 
sands of subjects is more typical), the list of test 
scores is an overpowering display. 

When confronted with a collection of quantita- 
tive data, the natural human tendency is to summa- 
rize, condense, and organize it into meaningful 
patterns. For example, in assessing the meaning of 
the curious professor’s vocabulary score, the reader 
might calculate the average score for the entire 
sample, or tally the relative position of the profes- 
sor’s score (17 correct) among the 100 data points 
found in Table 3.1. We review these and other ap- 


TABLE 3.1 Raw Scores of 100 Professors 
on a 30-Item Vocabulary Test 
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Source: Based on data from Gregory, R. J., & Gernert, C. H. 
(1990). Age trends for fluid and crystallized intelligence in an able 
subpopulation. Unpublished manuscript. 


proaches to organizing and summarizing quantita- 
tive data in the following sections. 


Frequency Distributions 


A very simple and useful way of summarizing data 
is to tabulate a frequency distribution (Table 3.2). A 
frequency distribution is prepared by specifying a 
small number of usually equal-sized class intervals 
and then tallying how many scores fall within each 
interval. The sums of the frequencies for all intervals 
will equal N, the total number of scores in the sam- 
ple. There is no hard and fast rule for determining 
the size of the intervals. Obviously, the size of the in- 
tervals depends upon the number of intervals de- 
sired. It is common for frequency distributions to 
include between 5 and 15 class intervals. In the case 
of Table 3.2, there are 9 class intervals of 3 scores 
each. The table indicates that one professor scored 4, 
5, or 6, eight professors scored 7, 8, or 9, and so on. 

A histogram provides a graphic representation 
of the same information contained in the frequency 
distribution (Figure 3.1a). The horizontal axis por- 
trays the scores grouped into class intervals, whereas 
the vertical axis depicts the number of scores falling 
within each class interval. In a histogram, the height 
of a column indicates the number of scores occurring 
within that interval. A frequency polygon is similar 


TOPIC 3A NORMS AND TEST STANDARDIZATION 59 


TABLE 3.2 Frequency Distribution of Scores 
of 100 Professors on a Vocabulary Test 





Class Interval Frequency 
4-6 1 
7-9 8 

10-12 12 
13-15 21 
16-18 24 
19-21 21 
22-24 7 
25-27 5 
28-30 1 

N=100 





to a histogram, except that the frequency of the class 
intervals is represented by single points rather than 
columns. The single points are then joined by 
straight lines (Figure 3.1b). 

The graphs shown in Figure 3.1 constitute vi- 
sual summaries of the 100 raw score data points 
from the sample of professors. In addition to visual 
summaries of data, it is also possible to produce 
numerical summaries by computing statistical in- 
dices of central tendency and dispersion. 


Measures of Central Tendency 


Can we designate a single, representative score for 
the 100 vocabulary scores in our sample? The mean 
(M), or arithmetic average, is one such measure of 
central tendency. We compute the mean by adding 
all the scores up and dividing by N, the number of 
scores. Another useful index of central tendency is 
the median, the middlemost score when all the 
scores have been ranked. If the number of scores is 
even, the median is the average of the middlemost 
two scores. In either case, the median is the point 
that bisects the distribution so that half of the cases 
fall above it, half below. Finally, the mode is sim- 
ply the most frequently occurring score. If two 
scores tie for highest frequency of occurrence, the 
distribution is said to be bimodal. 
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FIGURE 3.1 (a) A Histogram Representing Vocabulary Test Scores for 100 Professors. (b) A Frequency Polygon of Vocab- 


ulary Test Scores for 100 Professors. 


The mean of the scores listed in Table 3.1 is 
16.8; the median and mode are both 17. In this in- 
stance, the three measures of central tendency are 
in very good agreement. However, this is not al- 
ways so. The mean is sensitive to extreme values 
and can be misleading if a distribution has a few 
scores that are unusually high or low. Consider an 
extreme case in which nine persons earn $10,000 
and a tenth person earns $910,000. The mean in- 
come for this group is $100,000, yet this income 
level is not typical of anyone in the group. The me- 
dian income of $10,000 is much more representa- 
tive. Of course, this is an extreme example, but it 
illustrates a general point: If a distribution of scores 
is skewed (that is, asymmetrical), the median is a 
better index of central tendency than the mean. 


Measures of Variability 


Two or more distributions of test scores may have 
the same mean, yet differ greatly in the extent of 
dispersion of the scores about the mean (Figure 
3.2). To describe the degree of dispersion, we need 
a statistical index that expresses the variability of 
scores in the distribution. 

The most commonly used statistical index of 
variability in a group of scores is the standard de- 


viation, designated as s or abbreviated as SD. From 
a conceptual standpoint, the reader needs to know 
that the standard deviation reflects the degree of 
dispersion in a group of scores. If the scores are 
tightly packed around a central value, the standard 
deviation is small. In fact, in the extreme case in 





(b) 


(c) 





FIGURE 3.2 Three Distributions with Identical Means 
but Different Variability 
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which all the scores are identical, the standard de- 
viation is exactly zero. As a group of scores be- 
comes more spread out, the standard deviation 
becomes larger. For example, in Figure 3.2, distri- 
bution a would have the largest standard deviation, 
distribution c the smallest. 

The standard deviation, or s, is simply the 
square root of the variance, designated as s?. The 
formula for the variance is 


j MX -XY 
~ (N-1) 


where = designates “the sum of,” X stands for each 
individual score, X is the mean of the scores, and N 
is the total number of scores. As the name suggests, 
the variance is a measure of variability. However, 
psychologists usually prefer to report the standard 
deviation, which is computed by taking the square 
root of the variance. Of course, the variance and the 
standard deviation convey interchangeable infor- 
mation—one can be computed from the other by 
squaring (the standard deviation to obtain the vari- 
ance) or taking the square root (of the variance to 
obtain the standard deviation). The standard devia- 
tion is nonetheless the preferred measure of vari- 
ance in psychological testing because of its direct 
relevance to the normal distribution, as discussed 
in the next section. 


The Normal Distribution 


The frequency polygon depicted in Figure 3.1b is 
highly irregular in shape, a typical finding with 
real-world data based upon small sample sizes. 
What would happen to the shape of the frequency 
polygon if we increased the size of the normative 
sample and also increased the number of class in- 
tervals by reducing their size? Possibly, as we 
added new subjects to our sample, the distribution 
of scores would more and more closely resemble a 
symmetrical, mathematically defined, bell-shaped 
curve called the normal distribution (Figure 3.3). 

Psychologists prefer a normal distribution of 
test scores, even though many other distributions are 
theoretically possible. For example, a rectangular 





FIGURE 3.3 The Normal Curve and the Percentage 
of Cases within Certain Intervals 


distribution of test scores—an equal number of out- 
comes in each class interval—is within the realm of 
possibility. Indeed, many laypersons might even 
prefer a rectangular distribution of test scores on the 
egalitarian premise that individual differences are 
thereby less pronounced. For example, a higher pro- 
portion of persons would score in the superior range 
if psychological tests conformed to a rectangular 
rather than normal distribution of scores. 

Why, then, do psychologists prefer a normal 
distribution of test scores, even to the point of se- 
lecting test items that help produce this kind of dis- 
tribution in the standardization sample? There are 
several reasons, including statistical considerations 
and empirical findings. We digress briefly here to 
explain the psychometric fascination with normal 
distributions. 

One reason that psychologists prefer normal dis- 
tributions is that the normal curve has useful math- 
ematical features that form the basis for several 
kinds of statistical investigation. For example, sup- 
pose we wished to determine whether the average 
IQs for two groups of subjects were significantly dif- 
ferent. An inferential statistic such as the t-test for a 
difference between means would be appropriate. 
However, many inferential statistics are based upon 
the assumption that the underlying population of 
scores is normally distributed, or nearly so. Thus, in 
order to facilitate the use of inferential statistics, psy- 
chologists prefer that test scores in the general pop- 
ulation follow a normal or near-normal distribution. 
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Another basis for preferring the normal distri- 
bution is its mathematical precision. Since the 
normal distribution is precisely defined in mathe- 
matical terms, it is possible to compute the area un- 
derneath different regions of the curve with great 
accuracy. Thus, a useful property of normal distri- 
butions is that the percentage of cases falling within 
a certain range or beyond a certain value is precisely 
known. For example, in a normal distribution, a mere 
2.14 percent of the scores will exceed the mean by 
two standard deviations or more (Figure 3.3). In like 
manner, we can determine that the vast bulk of 
scores—more than 68 percent—fall within one stan- 
dard deviation of the mean in either direction. 

A third basis for preferring a normal distribu- 
tion of test scores is that the normal curve often 
arises spontaneously in nature. In fact, early inves- 
tigators were so impressed with the ubiquity of the 
normal distribution that they virtually deified the 
normal curve as a law of nature. For example, Gal- 
ton (1888) wrote: 


It is the supreme law of Unreason. Whenever a 
large sample of chaotic elements are taken in hand 
and marshalled in the order of their magnitude, an 
unsuspected and most beautiful form of regularity 
proves to have been latent all along. 


Certainly there is no “law of nature” regarding 
the form that frequency distributions must take. 
Nonetheless, it is true that many important human 
characteristics—both physical and mental—pro- 
duce a close approximation to the normal curve 
when measurements for large and heterogeneous 
samples are graphed. For example, a near-normal 
distribution curve is a well-known finding for phys- 
ical characteristics such as birthweight, height, and 
brain weight (Jensen, 1980). 

An approximately normal distribution is also 
found with numerous mental tests, even for tests 
constructed entirely without reference to the nor- 
mal curve. To illustrate this point, we refer to early 
tests devised before the current psychometric fixa- 
tion upon the normal distribution. Wechsler (1944) 
chose items for the original Wechsler-Bellevue In- 
telligence Scale largely on the basis of variety of 
item types, paying no heed to the resulting distri- 
bution of scores. In fact, he considered the belief 
that mental measures must distribute themselves 
according to the normal curve to be “mistaken.” 
Yet, when he graphed the distribution of Full Scale 
IQs on his test, the predictably near-normal distri- 
bution emerged (Figure 3.4). Lindvall (1967) found 
the same thing when plotting data from the 1923 
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Pintner Ability Test. We see, then, that even in the 
absence of psychometric tinkering, the distribution 
of mental test scores in standardization samples 
typically approximates a normal curve. 


Skewness 


Skewness refers to the symmetry or asymmetry of 
a frequency distribution. If test scores are piled up 
at the low end of the scale, the distribution is said 
to be positively skewed. In the opposite case, when 
test scores are piled up at the high end of the scale, 
the distribution is said to be negatively skewed 
(Figure 3.5). 

In psychological testing, skewed distributions 
usually signify that the test developer has included 
too few easy items or too few hard items. For ex- 
ample, when scores in the standardization sample 
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are massed at the low end (positive skew), the test 
probably contains too few easy items to make ef- 
fective discriminations at this end of the scale. In 
this case, examinees who obtain zero or near-zero 
scores might actually differ with respect to the di- 
mension measured. However, the test is unable to 
elicit these differences, since most of the items are 
too hard for these examinees. Of course, the oppo- 
site pattern holds as well. If scores are massed at the 
high end (negative skew), the test probably contains 
too few hard items to make effective discrimina- 
tions at this end of the scale. 

When initial research indicates that an instru- 
ment produces skewed results in the standardiza- 
tion sample, test developers typically revamp the 
test at the item level. The most straightforward so- 
lution is to add items or modify existing items so 
that the test has more easy items (to reduce positive 
skew) or more hard items (to reduce negative 
skew). If it is too late to revise the instrument, the 
test developer can use a statistical transformation 
to help produce a more normal distribution of 
scores (see the following). However, the preferred 
strategy is to revise the test so that skewness is min- 
imal or nonexistent. 


| RAW SCORE TRANSFORMATIONS 


Making sense out of test results is largely a matter 
of transforming the raw scores into more inter- 
pretable and useful forms of information. In the 
preceding discussion of normal distributions, we 
hinted at transformations by showing how knowl- 
edge of the mean and standard deviation of such 
distributions can help us determine the relative 
standing of an individual score. In this section we 
continue this theme in a more direct manner by in- 
troducing the formal requirements for several kinds 
of raw score transformations. 


Percentiles and Percentile Ranks 


A percentile expresses the percentage of persons in 
the standardization sample who scored below a 
specific raw score. For example, on the vocabulary 
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test depicted in Table 3.2, 94 percent of the sample 
fell below a raw score of 25. Thus, a raw score of 
25 would correspond to a percentile of 94, denoted 
as P,,. Note that higher percentiles indicate higher 
scores. In the extreme case, an examinee who ob- 
tained a raw score that exceeded every score in the 
standardization sample would receive a percentile 
of 100, or Pioo: 

The reader is warned not to confuse percentiles 
with percent correct. Remember that a percentile 
indicates only how an examinee compares to the 
standardization sample and does not convey the 
percentage of questions answered correctly. Con- 
ceivably, on a difficult test, a raw score of 50 per- 
cent correct might translate to a percentile of 90, 
95, or even 100. Conversely, on an easy test, a raw 
score of 95 percent correct might translate to a per- 
centile of 5, 10, or 20. 

Percentiles can also be viewed as ranks in a 
group of 100 representative subjects, with | being 
the lowest rank and 100 the highest. Note that per- 
centile ranks are the complete reverse of usual 
ranking procedures. A percentile rank (PR) of 1 is 
at the bottom of the sample, while a PR of 99 is near 
the top. 

A percentile of 50 (P,,) corresponds to the me- 
dian or middlemost raw score. A percentile of 25 
(P,5) is often denoted as Q1 or the first quartile 
because one-quarter of the scores fall below this 
point. In like manner, a percentile of 75 (P,,) is re- 
ferred to as Q3 or the third quartile because three- 
quarters of the scores fall below this point. 

Percentiles are easy to compute and intuitively 
appealing to laypersons and professionals alike. It 
is not surprising, then, that percentiles are the most 
common type of raw score transformation encoun- 
tered in psychological testing. Almost any kind of 
test result can be reported as a percentile, even 
when other transformations are the primary goal of 
testing. For example, intelligence tests are used to 
obtain IQ scores—a kind of transformation dis- 
cussed subsequently—but also yield percentile 
scores, too. Thus, an IQ of 130 corresponds to a 
percentile of 98, meaning that the score is not only 
well above average but, more precisely, exceeds 98 
percent of the standardization sample. 


Percentile scores do have one major drawback: 
They distort the underlying measurement scale, es- 
pecially at the extremes. A specific example will 
serve to clarify this point. Consider a hypothetical 
instance in which four persons obtain the following 
percentiles on a test: 50, 59, 90, and 99. (Remem- 
ber that we are speaking here of percentiles, not 
percent correct.) The first two persons differ by 9 
percentile points (50 versus 59) and so do the last 
two persons (90 versus 99). The untrained observer 
might assume, falsely, that the first two persons dif- 
fered in underlying raw score points by the same 
amount as the last two persons. An inspection of 
Figure 3.6 reveals the fallacy of this assumption. 
The difference in underlying raw score points be- 
tween percentiles of 90 and 99 is far greater than 
between percentiles of 50 and 59. 


Standard Scores 


Although percentiles are the most popular type of 
transformed score, standard scores exemplify the 
most desirable psychometric properties. A standard 
score uses the standard deviation of the total distri- 
bution of raw scores as the fundamental unit of 
measurement. The standard score expresses the 
distance from the mean in standard deviation units. 
For example, a raw score that is exactly one stan- 
dard deviation above the mean converts to a stan- 
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dard score of +1.00. A raw score that is exactly one- 
half a standard deviation below the mean converts 
to a standard score of — 0.50. Thus, a standard score 
not only expresses the magnitude of deviation from 
the mean, but the direction of departure (positive or 
negative) as well. 

Computation of an examinee’s standard score 
(also called a z score) is simple: Subtract the mean 
of the normative group from the examinee’s raw 
score and then divide this difference by the standard 
deviation of the normative group. Table 3.3 illus- 
trates the computation of z scores for three subjects 
of widely varying ability on a hypothetical test. 

Standard scores possess the desirable psycho- 
metric property ofretaining the relative magnitudes 
of distances between successive values found in the 
original raw scores. This is because the distribution 
of standard scores has exactly the same shape as the 
distribution of raw scores. As a consequence, the 
use of standard scores does not distort the underly- 
ing measurement scale. This fidelity of the trans- 
formed measurement scale is a major advantage of 
standard scores over percentiles and percentile 
ranks. As previously noted, percentile scores are 
very distorting, especially at the extremes. 

A specific example will serve to illustrate the 
nondistorting feature of standard scores. Consider 
four raw scores of 55, 60, 70, and 80 on a test with 


TABLE 3.3 Computation of Standard Scores on a 
Hypothetical Test 


For the normative sample: M = 50, SD = 8 


X-M 
SD 





Standard Score = z = 


Person A: raw score of 35 (below average) 


Person B: raw score of 50 (exactly average) 
z= 90-90 = 0.00 


Person C: raw score of 70 (above average) 


z= 030 = +2.50 
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mean of 50 and standard deviation of 10. The first 
two scores differ by 5 raw score points, while the 
last two scores differ by 10 raw score points—twice 
the difference of the first pair. When the raw scores 
are converted to standard scores, the results are 
+0.50, +1.00, +2.00, and +3.00, respectively. The 
reader will notice that the first two scores differ by 
0.50 standard scores, while the last two scores dif- 
fer by 1.00 standard scores—twice the difference of 
the first pair. Thus, standard scores always retain 
the relative magnitude of differences found in the 
original raw scores. 

Standard score distributions possess important 
mathematical properties that do not exist in the raw 
score distributions. When each of the raw scores in 
a distribution is transformed to a standard score, the 
resulting collection of standard scores always has a 
mean of zero and a variance of 1.00. Because the 
standard deviation is the square root of the vari- 
ance, the standard deviation of standard scores 
(v1.00) is necessarily 1.00 as well. 

One reason for transforming raw scores into 
standard scores is to depict results on different tests 
according to a common scale. If two distributions 
of test scores possess the same form, we can make 
direct comparisons on raw scores by transforming 
them to standard scores. Suppose, for example, that 
a first-year college student earned 125 raw score 
points on a spatial thinking test for which the nor- 
mative sample averaged 100 points (with SD of 15 
points). Suppose, in addition, he earned 110 raw 
score points on a vocabulary test for which the nor- 
mative sample averaged 90 points (with SD of 20 
points). In which skill area does he show greater 
aptitude, spatial thinking or vocabulary? 

If the normative samples for both tests pro- 
duced test score distributions of the same form, 
we can compare spatial thinking and vocabulary 
scores by converting each to standard scores. The 
spatial thinking standard score for our student is 
(125 — 100)/15 or +1.67, whereas his vocabulary 
standard score is (110 — 90)/20 or +1.00. Relative 
to the normative samples, the student has greater 
aptitude for spatial thinking than vocabulary. 

But a word of caution is appropriate when com- 
paring standard scores from different distributions. 
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If the distributions do not have the same form, stan- 
dard score comparisons can be very misleading. We 
illustrate this point with Figure 3.7, which depicts 
two distributions: one markedly skewed with aver- 
age score of 30 (SD of 10) and another normally 
distributed with average score of 60 (SD of 8). A 
raw score of 40 on the first test and a raw score of 
68 on the second test both translate to identical 
standard scores of +1.00. Yet, a standard score of 
1.00 on the first test exceeds 92 percent of the nor- 
mative sample, while the equivalent standard score 
on the second test exceeds only 84 percent of the 
normative sample. When two distributions of test 
scores do not possess the same form, equivalent 
standard scores do not signify comparable posi- 
tions within the respective normative samples. 


T Scores and Other Standardized Scores 


Many psychologists and educators appreciate the 
psychometric properties of standard scores but re- 
gard the decimal fractions and positive/negative 
signs (e.g., z = —2.32) as unnecessary distractions. 
In response to these concerns, test specialists have 


devised anumber of variations on standard scores that 
are collectively referred to as standardized scores. 

From a conceptual standpoint, standardized 
scores are identical to standard scores. Both kinds 
of scores contain exactly the same information. The 
shape of the distribution of scores is not affected, 
and a plot of the relationship between standard and 
standardized scores is always a straight line. How- 
ever, standardized scores are always expressed as 
positive whole numbers (no decimal fractions or 
negative signs), so many test users prefer to depict 
test results in this form. 

Standardized scores eliminate fractions and 
negative signs by producing values other than zero 
for the mean and 1.00 for the standard deviation of 
the transformed scores. The mean of the trans- 
formed scores can be set at any convenient value, 
such as 100 or 500, and the standard deviation at, 
say, 15 or 100. The important point about stan- 
dardized scores is that we can transform any distri- 
bution to a preferred scale with predetermined 
mean and standard deviation. 

One popular kind of standardized score is the T 
score, which has a mean of 50 and a standard de- 
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viation of 10. T score scales are especially common 
with personality tests. For example, on the MMPI, 
each clinical scale (e.g., Depression, Paranoia) is 
converted to a common metric for which 50 is the 
average score and 10 is the standard deviation for 
the normative sample. 

To transform raw scores to T scores, we use the 
following formula: 


10(X - M) 
= ————— + 
SD 


The’term (X — M)/SD is, of course, equivalent to z, 
so we can rewrite the formula for T as a simple 
transformation of z: 


T= 10z +50 


For any distribution of raw scores, the correspond- 
ing T scores will have an average of 50. In addition, 
for most distributions the vast majority of T scores 
will fall between values of 20 and 80, that is, within 
three standard deviations of the mean. Of course, T 
scores outside this range are entirely possible and 
perhaps even likely in special populations. In clin- 
ical settings it is not unusual to observe very high 
T scores—even as high as 90—on personality in- 
ventories such as the MMPI. 

Standardized scores can be tailored to produce 
any mean and standard deviation. However, to 
eliminate negative standardized scores, the prese- 
lected mean should be at least five times as large as 
the standard deviation. In practice, test developers 
rely upon a few preferred values for means and 
standard deviations of standardized scores, as out- 
lined in Table 3.4. 


T 50 
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Normalizing Standard Scores 


As previously noted, psychologists and educators 
prefer to deal with normal distributions because 
the statistical properties of the normal curve are 
well known and standard scores from these distri- 
butions can be directly compared. Perhaps the 
reader has wondered what recourse is available to 
test developers who find that their tests produce an 
asymmetrical distribution of scores in the nor- 
mative sample. Fortunately, distributions of scores 
that are skewed or otherwise nonnormal can be 
transformed or normalized to fit a normal curve. 
Although test specialists have devised several 
methods for transmuting a nonnormal distribution 
into a normal one, we will discuss only the most 
popular approach—the conversion of percentiles to 
normalized standard scores. Oddly enough, it is 
easier to explain this approach if we first describe 
the reverse process: conversion of standard scores 
to percentiles. 

We have noted that a normal distribution of raw 
scores has, by definition, a distinct, mathematically 
defined shape (Figure 3.3). In addition, we have 
pointed out that transforming a group of raw scores 
to standard scores leaves the original form of a dis- 
tribution unchanged. Thus, if a collection of raw 
scores is normally distributed, the resulting stan- 
dard scores will obey the normal curve, too. 

We also know that the mathematical properties 
of the normal distribution are precisely calculable. 
Without going into the details of computation, it 
should be obvious that we can determine the per- 
centage of cases falling below any particular stan- 
dard score. For example, in Figure 3.6, a standard 


TABLE 3.4 Means and Standard Deviations of Common Standardized Scores 


Type of Specific Standard 
Measure Examples Mean Deviation 
Full Scale IQ WAIS-IH 100 15 
IQ Test Subscales Vocabulary, Block Design 10 3 
Personality Test Scales MMPI-2 Depression, Paranoia 50 10 
Aptitude Tests Graduate Record Exam, 
Scholastic Assessment Tests 500 100 
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score of —2.00 (designated as -20) exceeds 2.14 
percent of the cases. Thus, a standard score of —2.00 
corresponds to a percentile of 2.14. In like manner, 
any conceivable standard score can be expressed in 
terms of its corresponding percentile. Appendix D 
lists percentiles for standard scores and several 
other transformed scores. 

Producing a normalized standard score is ac- 
complished by working in the other direction. 
Namely, we use the percentile for each raw score to 
determine its corresponding standard score. If we 
do this for each and every case in a nonnormal dis- 
tribution, the resulting distribution of standard 
scores will be normally distributed. Notice that in 
such a normalized standard score distribution, the 
standard scores are not calculated directly from the 
usual computational formula, but are determined 
indirectly by first computing the percentile and 
then ascertaining the equivalent standard score. 

The conversion of percentiles to normalized 
standard scores might seem an ideal solution to the 
problem of unruly test data. However, there is a po- 
tentially serious drawback: Normalized standard 
scores are a nonlinear transformation of the raw 
scores. Thus, mathematical relationships estab- 
lished with the raw scores may not hold true for the 
normalized standard scores. In a markedly skewed 
distribution, it is even possible that a raw score that 
is significantly below the mean might conceivably 
have a normalized standard score that is above the 
mean. 

In practice, normalized standard scores are 
used sparingly. Such transformations are appropri- 
ate only when the normative sample is large and 
representative and the raw score distribution is only 
mildly nonnormal. Incidentally, the most likely 
cause of these nonnormal score distributions is in- 
appropriate difficulty level in the test items, such as 
too many difficult or easy items. 


There is a catch-22 here, in that mildly non- 
normal distributions are not changed much when 
they are normalized, so little is gained in the 
process. Ironically, normalized standard scores pro- 
duce the greatest change with markedly nonnormal 
distributions. However, when the raw score distri- 
bution is markedly nonnormal, test developers are 
better advised to go back to the drawing board and 
adjust the difficulty level of test items so as to pro- 
duce a normal distribution, rather than succumb to 
the partial statistical fix of normalized standard 
scores. 


Stanines, Stens, and C Scale 


Finally, we give brief mention to three raw score 
transformations that are mainly of historical inter- 
est. The stanine (standard nine) scale was devel- 
oped by the United States Air Force during World 
War II. In a stanine scale, all raw scores are con- 
verted to a single-digit system of scores ranging 
from 1 to 9. The mean of stanine scores is always 
5, and the standard deviation is approximately 2. 
The transformation from raw scores to stanines is 
simple: The scores are ranked from lowest to high- 
est, and the bottom 4 percent of scores convert to a 
stanine of 1, the next 7 percent convert to a stanine 
of 2, and so on (see Table 3.5). The main advantage 
of stanines is that they are restricted to single-digit 
numbers. This was a considerable asset in the 
premodern computer era in which data was key- 
punched on Hollerith cards that had to be physi- 
cally carried and stored on shelves. Because a 
stanine could be keypunched in a single column, far 
fewer cards were required than if the original raw 
scores were entered. 

Statisticians have proposed several variations 
on the stanine theme. Canfield (1951) proposed the 
10-unit sten scale, with 5 units above and 5 units 


TABLE 3.5 Distribution Percentages for Use in Stanine Conversion 
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below the mean. Guilford and Fruchter (1978) pro- 
posed the C scale consisting of 11 units. Although 
stanines are still in widespread use, variants such as 
the sten and C scale never roused much interest 
among test developers. 


A Summary of Statistically Based Norms 


We have alluded several times to the ease with 
which standard scores, T scores, stanines, and per- 
centiles can be transformed into each other, espe- 
cially if the underlying distribution of raw scores is 
normally distributed. In fact, the exact form in 
which scores are reported is largely a matter of con- 
vention and personal preference. For example, a 
WAIS-III IQ of 115 could also be reported as a stan- 
dard score of +1.00, or a T score of 60, or a per- 


Transformation in a 
Normal Distribution 


centile rank of 84. All of these results convey ex- 
actly the same information.! Figure 3.8 summarizes 
the relationships that:exist between the most com- 
monly used statistically based norms. 

This ends the brief introduction to the many 
techniques by which test data from a normative 
sample:can be statistically summarized and trans- 
formed. We should never lose sight of the overrid- 
ing purpose of these statistical transmutations, 
namely, to help the test user make sense out of one 


1. A WAIS-III IQ of 115 also can be expressed as a stanine of 
7. However, it is worth noting that some information is lost when 
scores are reported as stanines. Note that IQs in the range of 111 
to 119 all convert to a stanine of 7. Thus, if we are told only that 
an individual has achieved at the 7th stanine on an intelligence 
test, we do not know the exact IQ equivalent. 
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individual’s score in relation to an appropriate com- 
parison group. 

But what is an appropriate comparison group? 
What characteristics should we require in our norm 
group subjects? How should we go about choosing 
these subjects? How many subjects do we need? 
These are important questions that influence the 
relevance of test results just as much as proper item 
selection and standardized testing procedure. In the 
remainder of this topic, we examine the procedures 
involved in selecting a norm group. 


|| SELECTING A NORM GROUP 


When choosing a norm group, test developers strive 
to obtain a representative cross section of the pop- 
ulation for whom the test is designed (Petersen, 
Kolen, & Hoover, 1989). In theory, obtaining a rep- 
resentative norm group is straightforward and sim- 
ple. Consider a scholastic achievement test 
designed for sixth graders in the United States. The 
relevant population is all sixth graders coast to 
coast and in Alaska and Hawaii. A representative 
cross section of these potential subjects could be 
obtained by computerized random sampling of 
10,000 or so of the millions of eligible children. 
Each child would have an equal chance of being 
chosen to take the test; that is, the selection strat- 
egy would be simple random sampling. The re- 
sults for such a sample would comprise an ideal 
source of normative data. With a large random sam- 
ple, it is almost certain that the diversities of ethnic 
background, social class, geographic location, 
urban versus rural setting, and so on would be pro- 
portionately represented in the sample. 

In the real world, obtaining norm samples is 
never as simple and definitive as the hypothetical 
case previously outlined. Researchers do not have 
a complete list of every sixth grader in the nation, 
and even if they did, test developers could not com- 
pel every randomly selected child to participate in 
the standardization of a test. Questions of cost arise, 
too. Psychometricians must be paid to administer 
the tests to the norm group. Test developers may 
opt for a few hundred representative subjects in- 
stead of a larger number. 


To help ensure that smaller norm groups are 
truly representative of the population for which the 
test was designed, test developers employ strati- 
fied random sampling. This approach consists of 
stratifying, or classifying, the target population on 
important background variables (e.g., age, sex, 
race, social class, educational level) and then se- 
lecting an appropriate percentage of persons at ran- 
dom from each stratum. For example, if 12 percent 
of the relevant population is African American, 
then the test developer chooses subjects randomly, 
but with the constraint that 12 percent of the norm 
group is also African American. 

In practice, very few test developers fully em- 
ulate either random sampling or stratified random 
sampling in the process of selecting the norm 
group. What is more typical is a good faith effort to 
pick a diverse and representative sample from 
strong and weak schools, minority and white neigh- 
borhoods, large and small cities, and north, east, 
central, and southern communities. If this sample 
then embodies about the same percentage of mi- 
norities, city dwellers, upper- and lower-class fam- 
ilies as the national census, then the test developer 
feels secure that the norm group is representative. 

There is an important lesson in the uncertain- 
ties, compromises, and pragmatics of norm group 
selection; namely, psychological test norms are not 
absolute, universal, or timeless. They are relative to 
one historical era and the particular normative pop- 
ulation from which they were derived. We will il- 
lustrate the ephemeral nature of normative statistics 
in a later section when we show how a major IQ 
test normed at a national average of 100 in 1974 
yielded a national average of 107 in 1988. Even 
norms that are selected with great care and based 
on large samples can become obsolete in a 
decade—sometimes less. 


Age and Grade Norms 


As we grow older, we change in measurable ways, 
for better or worse. This is obviously true in child- 
hood, when intellectual skills improve visibly from 
one month to the next. In adulthood, personal 
change is slower but still discernible. We expect, 


for example, that adults will show a more mature 
level of vocabulary with each passing decade (Greg- 
ory & Gernert, 1990). 

An age norm depicts the level of test perfor- 
mance for each separate age group in the norma- 
tive sample. The purpose of age norms is to 
facilitate same-aged comparisons. With age norms, 
the performance of an examinee is interpreted in re- 
lation to standardization subjects of the same age. 
The age span for a normative age group can vary 
from a month to a decade or more, depending upon 
the-degree to which test performance is age-depen- 
dent. For characteristics that change quickly with 
age—such as intellectual abilities in childhood— 
test developers might report separate test norms for 
natrowly defined age brackets, such as four-month 
intervals. This allows the examiner, for example, to 
compare test results of a child who is 5 years and 2 
months old (age 5-2) to the normative sample of 
children ranging from age 5-0 to age 5-4. By con- 
trast, adult characteristics change more slowly and 
it might be sufficient to report normative data by 5- 
or 10-year age intervals. 

Grade norms are conceptually similar to age 
norms. A grade norm depicts the level of test per- 
formance for each separate grade in the normative 
sample. Grade norms are rarely used with ability 
tests. However, these norms are especially useful in 
school settings when reporting the achievement 
levels of schoolchildren. Since academic achieve- 
ment in many content areas is heavily dependent 
upon grade-based curricular exposure, comparing a 
student against a normative sample from the same 
grade is more appropriate than using an age-based 
comparison. 


Local and Subgroup Norms 


With many applications, local or subgroup norms 
are needed to suit the specific purpose of a test. 
Local norms are derived from representative local 
examinees, as opposed to a national sample. Like- 
wise, subgroup norms consist of the scores 
obtained from an identified subgroup (African 
Americans, Hispanics, females), as opposed to a di- 
versified national sample. As an example of local 
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norms in action, the admissions officer of a junior 
college that attracts mainly local residents might 
prefer to consult statewide norms rather than na- 
tional norms on a scholastic achievement test. 

As a general rule, whenever an identifiable sub- 
group performs appreciably better or worse on a 
test than the more broadly defined standardization 
sample, it may be helpful to construct supplemen- 
tary subgroup norms. The subgroups can be formed 
with respect to sex, ethnic background, geographi- 
cal region, urban versus rural environment, socio- 
economic level, and many other factors. 

Whether local or subgroup norms are beneficial 
depends on the purpose of testing. For example, 
ethnic norms for standardized intelligence tests 
may be superior to nationally based norms in pre- 
dicting competence within the child’s nonschool 
environment. However, ethnic norms may not pre- 
dict how well a child will succeed in mainstream 
public school instructional programs (Mercer & 
Lewis, 1978). Thus, local and subgroup norms 
must be used cautiously. 


Expectancy Tables 


One practical form that norms may take is an ex- 
pectancy table. An expectancy table portrays the 
established relationship between test scores and ex- 
pected outcome on a relevant task (Harmon, 1989). 
Expectancy tables are especially useful with pre- 
dictor tests used to forecast well-defined criteria. 
For example, an expectancy table could depict the 
relationship between scores on a scholastic aptitude 
test (predictor) and subsequent college grade point 
average (criterion). 

Expectancy tables are always based on the pre- 
vious predictor and criterion results for large sam- 
ples of examinees. The practical value of tabulating 
normative information in this manner is that new 
examinees receive a probabilistic preview of how 
well they are likely to do on the criterion. For ex- 
ample, high school examinees who take a scholas- 
tic aptitude test can be told the statistical odds of 
achieving a particular college grade point average. 

Based on 7,835 previous examinees who subse- 
quently attended a major university, the expectancy 
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table in Table 3.6 provides the probability of achiev- 
ing certain first-year college grades as a function of 
score on the American College Testing (ACT) ex- 
amination. The ACT test is typically given to high 
school seniors who have expressed an interest in at- 
tending college. The first column of the table shows 
ACT test scores, divided into 10 class intervals. The 
second column gives the number of students whose 
scores fell into each interval. The remaining entries 
in each row show the percentage of students within 
each test-score interval who subsequently received 
college grade points within a designated range. For 
example, of the 117 students who scored 31 to 33 
points on the ACT, only 2 percent received a first- 
year college grade point average below 1.50, while 
64 percent earned superlative grades of 3.50 up to a 
perfect A or 4.00. At the other extreme, of the 102 
students who scored below 10 points on the ACT, 
fully 80 percent (60 percent plus 20 percent) received 
first-year college grades below a C average of 2.00. 

Of course, expectancy tables do not foreordain 
how new examinees will do on the criterion. In an 
individual case, it is conceivable that a low-ACT 
scoring student might beat the odds and earn a 4.00 
college grade point average. More commonly, 


though, new examinees discover that expectancy 
tables provide a broadly accurate preview of crite- 
rion performance. 

But there are some exceptional instances in 
which expectancy tables can become inaccurate. 
An expectancy table is always based on the previ- 
ous performance of a large and representative sam- 
ple of examinees whose test performances and 
criterion outcomes reflected existing social condi- 
tions and institutional policies. If conditions or 
policies change, an expectancy table can become 
obsolete and misleading. Consider the expectancy 
table in Figure 3.9, which depicts the likelihood of 
finishing high school as a function of seventh-grade 
IQ (Dillon, 1949, cited in Matarazzo, 1972, p. 283). 
Notice that in the 1940s only 4 percent of seventh- 
grade students with IQs below 85 went on to finish 
high school. However, social policies and school 
environments have changed since the 1940s. There 
is currently a strong emphasis on special services 
for students with disabilities, with the aim of re- 
tention and eventual graduation. As a result, the ex- 
pectancy table in Figure 3.9 surely would be 
pessimistically erroneous if applied to contempo- 
rary seventh-grade students with low IQs. 


TABLE 3.6 Expectancy Table Showing Relation between ACT Composite Scores 
and First-Year College Grades for 7,835 Students at a Major State University 


ACT Test Number of 
Score Cases 
34-36 3 
31-33 117 
28-30 646 
25-27 1,458 
22-24 1,676 
19-21 1,638 
16-18 1,173 
13-15 690 
10-12 332 
below 10 102 


Grade Point Average (4.00 Scale) 


0.00- 1.50- 2.00- 2.50- 3.00- 3.50- 

1.49 1.99 2.49 2.99 3.49 4.00 
0 0 33 0 0 67 
2 2 4 9 19 64 
10 6 10 17 23 35 
12 10 16 19 24 19 
17 10 22 20 20 11 
23 14 25 18 16 4 
31 17 24 15 11 3 
38 18 25 12 6 1 
54 16 20 6 3 1 
60 20 13 8 0 0 





Note: Some rows total to more than 100 percent because of rounding errors. 
Source: Courtesy of Archie George, Management Information Services, University of Idaho. 
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FIGURE 3.9 Expectancy of High School Graduation as 
a Function of Seventh-Grade IQ 

Source: Based on data from Dillon, H. J. (1949). Early school 
leavers: A major educational problem. New York: National Child 
Labor Committee. Cited in Matarazzo (1972). 


|| cRITERION-REFERENCED TESTS 


We close this unit with brief mention of an alter- 
native to norm-referenced tests, namely, criterion- 
referenced tests. These two kinds of tests differ in 
their intended purposes, the manner in which con- 
tent is chosen, and the process of interpreting re- 
sults (Berk, 1984; Bond, 1996; Frechtling, 1989; 
Popham, 1978). 

The purpose of a norm-referenced test is to 
classify examinees, from low to high, across a 
continuum of ability or achievement. Thus, a norm- 
referenced test uses a representative sample of 
individuals—the norm group or standardization 
sample—as its interpretive framework. Examiners 
might want to classify individuals in this way for 
purposes of selection to a specialized curriculum or 
placement in remedial or gifted programs. In a 
classroom setting, a teacher might use a norm- 
referenced test to assign students to different read- 
ing levels or math instructional groups (Bond, 
1996). 

Whereas norm-referenced tests are used to rank 
students along a continuum in comparison to one 
another, criterion-referenced tests are used to com- 
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pare examinees’ accomplishments to a predefined 
performance standard. For example, consider a hy- 
pothetical school system in which fourth graders 
are expected to master the addition of pairs of two- 
digit numbers (e.g., 23 + 19 = 42). Perhaps the per- 
formance standard is set at 80 percent accuracy 
when doing ten such addition problems in a 15- 
minute time period. Results for a specific fourth 
grader are then descriptively stated as a particular 
percentage (e.g., 70 percent). While it is possible to 
compare this result to the predetermined standard, 
no comparison is made to other students. In fact, it 
is entirely possible (and even desirable) for all stu- 
dents to exceed the standard. 

Criterion-referenced tests represent a funda- 
mental shift in perspective. The focus is on what 
the test taker can do rather than on comparisons to 
the performance levels of others. Thus, criterion- 
referenced tests identify an examinee’s relative mas- 
tery (or nonmastery) of specific, predetermined 
competencies. These kinds of tests are increasingly 
popular in educational systems, where they are 
used to evaluate how well students have mastered 
the academic skills expected at each grade level. 
This information, in turn, provides a basis for 
intervention with students who are lagging behind. 
In addition, system-wide results of criterion- 
referenced tests can be used to evaluate the cur- 
riculum and to determine how well individual 
schools are teaching the curriculum. 

A major difference between norm-referenced 
tests and criterion-referenced tests is the manner in 
which test content is chosen. In a norm-referenced 
test, items are chosen so that they provide maximal 
discrimination among respondents along the di- 
mension being measured. Within this framework, 
well-defined psychometric principles are used to 
identify ideal items according to difficulty level, cor- 
relation with the total score, and other properties. In 
contrast, with a criterion-referenced test, the content 
is selected on the basis of its relevance in the cur- 
riculum. This involves the judgment and consensus 
of educators and other stakeholders in the educa- 
tional enterprise. In Table 3.7, we have summarized 
and compared some distinctive characteristics of 
criterion-referenced and norm-referenced tests. 
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TABLE 3.7 Distinctive Characteristics of Criterion-Referenced 


and Norm-Referenced Tests 





Dimension Criterion-Referenced Tests Norm-Referenced Tests 

Purpose ‚Compare examinees’ Compare examinees’ 
performance to a standard performance to one another 

Item Narrow domain of skills Broad domain of skills with 

Content with real-world relevance indirect relevance 

Item Most items of similar Items vary widely in difficulty 

Selection difficulty level level 

Interpretation Scores usually expressed as Scores usually expressed as a 

of Scores a percentage, with passing standard score, percentile, or 


level predetermined 


grade equivalent 





Criterion-referenced tests are best suited to 
the testing of basic academic skills (e.g., reading 
level, computation skill) in educational settings. 
However, these kinds of instruments are largely 
inappropriate for testing higher-level abilities be- 
cause it is difficult to formulate specific objectives 
for such content domains. Consider a particular 
case: How could we develop a criterion-referenced 


test for expert computer programming? It would 
be difficult to propose specific behaviors that all ex- 
pert computer programmers would possess, and 
therefore nearly impossible to construct a criterion- 
referenced test for this high-level skill. Berk (1984) 
discusses the technical problems in the construc- 
tion and evaluation of criterion-referenced tests. 


SUMMARY 


1. A norm group consists of a sample of ex- 
aminees who are representative of the population 
for whom the test is intended. A frequency distri- 
bution is useful in portraying the distribution of test 
scores within certain score intervals for a norm 
group. A histogram is a graphic representation of a 
frequency distribution. 


2. Measures of central tendency for collec- 
tions of scores include the mean or arithmetic av- 
erage; the median or middlemost of the ranked 
scores; and the mode, which is the most frequently 
occurring score. 


3. Measures of variability for a group of 
scores include the variance and its square root, the 
standard deviation, which is the preferred measure 
in psychological testing. These indices help gauge 
the dispersion of scores by incorporating the sums 
of squared deviations from the mean score in their 
formulas. 


4. The distribution of test scores for large 
groups of heterogeneous examinees often resem- 
bles the normal distribution, a symmetrical, mathe- 
matically defined, bell-shaped curve. Psychologists 
prefer to deal with normally distributed test scores 
because the statistical characteristics of the normal 
distribution are well known. 


5. A skewed distribution is one in which the 
scores pile up at the low end (positive skew) or the 
high end (negative skew). On psychological tests, 
the most common cause of positive skew is too few 
easy items, whereas the most common cause of 
negative skew is too few hard items. 

6. A percentile expresses the percentage of 
persons in the standardization sample who scored 
below a specific raw score. Percentiles vary from 0 
to 100. It is important to distinguish percentile (a 
relative measure) from percent correct (an absolute 
measure). 


7. A standard score expresses an examinee’s 
raw score in terms of its distance from the mean in 
standard deviation units. The formula for a standard 
score is z= (X —M)/SD. AT score is a standardized 
score with mean of 50 and standard deviation of 10. 
The formula for a T score is 


T = 10(X - M)/SD + 50 


8. The most common approach to selecting a 
norm group is through stratified random sampling. 
In this procedure, the target population is stratified 
or classified on important background variables 
(e.g., age, sex, race, social class, educational level) 
and then an appropriate percentage of persons is 
chosen at random within each stratum. 


9. For many tests, it is important to provide 
separate age and grade norms. Age norms are nec- 
essary for characteristics that change quickly with 
developing age—for example, intellectual abilities 
in childhood. Grade norms are commonly used in 
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school settings when reporting the achievement 
levels of schoolchildren. 


10. Local and subgroup norms may be valu- 
able if an identifiable subgroup performs apprecia- 
bly better or worse on a test than the more broadly 
defined standardization sample. 


11. An expectancy table—one form of test 
standardization—portrays the established relation- 
ship between test scores and expected outcome on 
a relevant task. For example, an expectancy table 
might depict the relationship between scores on a 
scholastic aptitude test and subsequent college 
grade point average. 

12. Acriterion-referenced test compares an ex- 
aminee’s test accomplishments to a well-defined 
content domain. These tests help identify an exam- 
inee’s mastery or nonmastery of specific behaviors. 
For example, results of a criterion-referenced test 
might specify that the examinee can add two 3-digit 
numbers correctly 100 percent of the time. 


KEY TERMS AND CONCEPTS 


norm group p. 57 

raw score p.58 
frequency distribution p. 59 
histogram p. 60 
frequency polygon p. 60 
mean p.60 

median p.60 

mode p. 60 

standard deviation p. 60 
variance p. 61 

normal distribution p. 61 
skewness p.63 
percentile p. 63 


standard score p. 64 

Tscore p.66 

normalized standard score p. 68 
stanine scale p. 68 

sten scale p. 68 

C scale p.69 

random sampling p. 70 
stratified random sampling p. 70 
age norm p.71 

grade norm p.71 

local norms p.71 

subgroup norms p.71 
expectancy table p.71 


Topic 3B Concepts of Reliability 


Case Exhibit 3.2 Test Reliability and Courtroom Testimony 
Classical Test Theory and the Sources of Measurement Error 


Sources of Measurement Error 
Measurement Error and Reliability 
The Reliability Coefficient 
The Correlation Coefficient 
The Correlation Coefficient as a Reliability Coefficient 
Reliability as Temporal Stability 

Reliability as Internal Consistency 

Item Response Theory and the New Rules of Measurement 
Special Circumstances in the Estimation of Reliability 

The Interpretation of Reliability Coefficients 

Reliability and the Standard Error of Measurement 


Summary 


Key Terms and Concepts 


RR: refers to the attribute of consistency 
in measurement. However, reliability is sel- 
dom an all-or-none matter; more commonly it is a 
question of degree. Very few measures of physical 
or psychological characteristics are completely con- 
sistent, even from one moment to the next. For ex- 
ample, a person who steps on a scale twice in quick 
succession might register a weight of 145% pounds 
the first time and 145% pounds the second. The same 
individual might take two presumably equivalent 
forms of an IQ test and score 114 on one and 119 on 
the other. Two successive measures of speed of re- 
sponse—pressing a key quickly whenever the letter 
X appears on a microcomputer screen—might pro- 
duce a reaction time of 223 milliseconds on the first 
trial and 341 milliseconds on the next. We see in 
these examples a pattern of consistency—the pairs 
of measurements are not completely random—but 
different amounts of inconsistency are evident, too. 
In the short run, measures of weight are highly con- 
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sistent, intellectual test scores are moderately stable, 
but simple reaction time is somewhat erratic. 

The concept of reliability is best viewed as a 
continuum ranging from minimal consistency of 
measurement (e.g., simple reaction time) to near- 
perfect repeatability of results (e.g., weight). Most 
psychological tests fall somewhere in between 
these two extremes. With regard to tests, an ac- 
ceptable degree of reliability is more than an aca- 
demic matter. After all, it would be foolish and 
unethical to base important decisions upon test re- 
sults that are not repeatable (Case Exhibit 3.2). 

Psychometricians have devised several statisti- 
cal methods for estimating the degree of reliability 
of measurements, and we will explore the compu- 
tation of such reliability coefficients in some detail. 
But first we examine a more fundamental issue to 
help clarify the meaning of reliability: What are the 
sources of consistency and inconsistency in psy- 
chological test results? 
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CLASSICAL TEST THEORY AND THE 
SOURCES OF MEASUREMENT ERROR 


The theory of measurement introduced here has 
been called the classical test theory because it was 
developed from simple assumptions made by test 
theorists since the inception of testing. This ap- 
proach is also called the theory of true and error 
scores, for reasons explained below. Charles Spear- 
man (1904) laid down the foundation for the theory 
which was subsequently extended and revised by 
contemporary psychologists (Feldt & Brennan, 
1989; Gulliksen, 1950; Lord & Novick, 1968; 
Kline, 1986). We should mention that a rival model 
does exist and is slowly supplanting classical test 
theory as a basis for test development. Item re- 
sponse theory, or latent trait theory (Embretson & 
Hershberger, 1999), is an appealing alternative to 
classical test theory. We close this chapter with a 
brief review of item response theory. However, 
classical test theory was the basis for test develop- 
ment throughout most of the twentieth century. Ac- 
cordingly, we begin our coverage with this model. 

The basic starting point of the classical theory 
of measurement is the idea that test scores result 
from the influence of two factors: 


1. Factors that contribute to consistency. These con- 
sist entirely of the stable attributes of the indi- 
vidual, which the examiner is trying to measure. 

2. Factors that contribute to inconsistency. These 
include characteristics of the individual, test, or 
situation that have nothing to do with the at- 
tribute being measured, but that nonetheless af- 
fect test scores. 


It should be clear to the reader that the first fac- 
tor is desirable because it represents the true 
amount of the attribute in question, while the sec- 
ond factor represents the unavoidable nuisance of 
error factors that contribute to inaccuracies of mea- 
surement. We can express this conceptual break- 
down as a simple equation: 


X=T+e 


where X is the obtained score, T is the true score, 
and e represents errors of measurement. 


Errors in measurement thus represent discrep- 
ancies between the obtained scores and the corre- 
sponding true scores: 


e=X-T 


Notice in the preceding equations that errors of 
measurement e can be either positive or negative. If 
e is positive, the obtained score X will be higher 
than the true score T. Conversely, if e is negative, 
the obtained score will be lower than the true score. 
Although it is impossible to eliminate all mea- 
surement error, test developers do strive to mini- 
mize this psychometric nuisance through careful 
attention to the sources of measurement error out- 
lined in the following section. 

Finally, it is important to stress that the true 
score is never known. As the reader will discover, 
we can obtain a probability that the true score re- 
sides within a certain interval and we can also de- 
rive a best estimate of the true score. However, we 
can never know the value of a true score with 
certainty. 


I] SOURCES OF MEASUREMENT ERROR 


As indicated by the formula X = T + e, measure- 
ment error e is everything other than the true score 
that makes up the obtained test score. Errors of 
measurement can arise from innumerable sources 
(Feldt & Brennan, 1989). Stanley (1971) provides 
an unusually thorough list. We will outline only the 
most important and likely contributions here: item 
selection, test administration, test scoring, and sys- 
tematic errors of measurement. 


Item Selection 


One source of measurement error is the instrument 
itself. A test developer must settle upon a finite 
number of items from a potentially infinite pool of 
test questions. Which questions should be in- 
cluded? How should they be worded? Item selec- 
tion is crucial to the accuracy of measurement. 
Although psychometricians strive to obtain rep- 
resentative test items, the particular set of questions 
chosen for a test might not be equally fair to all 


persons. A hypothetical and deliberately extreme 
example will serve to illustrate this point: Even a 
well-prepared student might flunk a classroom test 
that emphasized the obscure footnotes in the text- 
book. By contrast, an ill-prepared but curious stu- 
dent who studied only the footnotes might do very 
well on such an exam. The scores for both persons 
would reflect massive amounts of measurement 
error. Remember in this context that the true score 
is what the student really knows. For the conscien- 
tious student, the obtained score would be far lower 
than the true score, because of a hefty dose of neg- 
ative measurement error. For the serendipitous sec- 
ond student, the obtained score would be far higher 
than the true score, owing to the positive measure- 
ment error. 

Of course, in a well-designed test the measure- 
ment error from item sampling will be minimal. 
However, a test is always a sample and never the to- 
tality of a person’s knowledge or behavior. As a 
result, item selection is always a source of mea- 
surement error in psychological testing. The best a 
psychometrician can do is minimize this unwanted 
nuisance by attending carefully to issues of test 
construction. We discuss technical aspects of item 
selection in Topic 4B, Test Construction. 


Test Administration 


Although examiners usually provide an optimal 
and standardized testing environment, numerous 
sources of measurement error may nonetheless 
arise from the circumstances of administration. Ex- 
amples of general environmental conditions that 
may exert an untoward influence on the accuracy of 
measurement include uncomfortable room temper- 
ature, dim lighting, and excessive noise. In some 
cases it is not possible to anticipate the qualities of 
the testing situation that will contribute to mea- 
surement error. Consider this example: An other- 
wise lackluster undergraduate correctly answers a 
not very challenging information item, namely, 
“Who wrote Canterbury Tales?’ When queried 
later whether he had read any Chaucer, the student 
replies, “No, but you’ve got that book right behind 
you on your bookshelf.” 
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Momentary fluctuations in anxiety, motivation, 
attention, and fatigue level of the test taker may also 
introduce sources of measurement error. For exam- 
ple, an examinee who did not sleep well the night 
before might lack concentration and therefore mis- 
read questions. A student distracted by temporary 
emotional distress might inadvertently respond in 
the wrong columns of the answer sheet. The clas- 
sic nightmare in this regard is the test taker who 
skips a question—let us say, question number 19— 
but forgets to leave the corresponding part of the 
answer sheet blank. As a result, all the subsequent 
answers are off by one, with the response to ques- 
tion 20 entered on the answer sheet as item 19, and 
so on. 

The examiner, too, may contribute to measure- 
ment error in the process of test administration. In 
an orally administered test, an unconscious nod of 
the head by the tester might convey that the exam- 
inee is on the right track, thereby guiding the test 
taker to the correct response. Conversely, a terse 
and abrupt examiner may intimidate a test taker 
who would otherwise volunteer a correct answer. 


Test Scoring 


Whenever a psychological test uses a format 
other than machine-scored multiple-choice items, 
some degree of judgment is required to assign 
points to answers. Fortunately, most tests have 
well-defined criteria for answers to each question. 
These guidelines help minimize the impact of 
subjective judgment in scoring (Gregory, 1987). 
However, subjectivity of scoring as a source of 
measurement error can be a serious problem in the 
evaluation of projective tests or essay questions. 
With regard to projective tests, Nunnally (1978) 
points out that the projective tester might undergo 
an evolutionary change in scoring criteria over time, 
coming to regard a particular type of response as 
more and more pathological with each encounter. 


Systematic Measurement Error 


The sources of inaccuracy previously discussed are 
collectively referred to as unsystematic measurement 


80 _CHAPTER3 NORMS AND RELIABILITY 


error, meaning that their effects are unpredictable 
and inconsistent. However, there is another type of 
measurement error that constitutes a veritable ghost 
in the psychometric machine. A systematic mea- 
surement error arises when, unknown to the test 
developer, a test consistently measures something 
other than the trait for which it was intended. Sup- 
pose, for example, that a scale to measure social in- 
troversion also inadvertently taps anxiety in a 
consistent fashion. In this case, the equation de- 
picting the relationship between observed scores, 
true scores, and sources of measurement error 
would be 


X=T+e +e, 


where X is the obtained score, T is the true score, 
e, is the systematic error due to the anxiety sub- 
component, and e, is the collective effect of the 
unsystematic measurement errors previously 
outlined. 

Because by definition their presence is initially 
undetected, systematic measurement errors may 
constitute a significant problem in the development 
of psychological tests. However, if psychometri- 
cians use proper test development procedures dis- 
cussed in Topic 4B, Test Construction, the impact 
of systematic measurement errors can be greatly 
minimized. Nonetheless, systematic measurement 
errors serve as a reminder that it is very difficult, if 
not impossible, to truly assess a trait in pure isola- 
tion from other traits. 

We-have furnished the barest outline of the nu- 
merous and varied sources of measurement error in 
this section. The reader may wish to review Topic 
2B, The Testing Process, which provides more de- 
tail on the multitudinous factors that can sway the 
outcome of psychological testing and thereby in- 
troduce measurement error. 


AND RELIABILITY 


Perhaps at this point the reader is wondering what 
measurement error has to do with reliability. The 
most obvious connection is that measurement error 


|| MEASUREMENT ERROR 


reduces the reliability or repeatability of psycho- 
logical test results. In fact, we will show here that 
reliability bears a precise statistical relationship to 
measurement error. Reliability and measurement 
error are really just different ways of expressing the 
same concern: How consistent is a psychological 
test? The interdependence of these two concepts 
will become clear if we provide a further sketch of 
the classical theory of measurement. 

A crucial assumption of classical theory is that 
unsystematic measurement errors act as random in- 
fluences. This does not mean that the sources of 
measurement error are completely mysterious and 
unfathomable in every individual case. We might 
suspect for one person that her score on digit span 
reflected a slight negative measurement error 
caused by the auditory interference of someone 
coughing in the hallway during the presentation of 
the fifth item. Likewise, we could conjecture that 
another person received the benefit of positive mea- 
surement error by glimpsing in the mirror behind 
the examiner to see the correct answer to the ninth 
item on an information test. Thus, measurement 
error is not necessarily a mysterious event in every 
individual case. 

However, when we examine the test scores of 
groups of persons, the causes of measurement error 
are incredibly complex and varied. In this context, 
unsystematic measurement errors behave like ran- 
dom variables. The classical theory accepts this es- 
sential randomness of measurement error as an 
axiomatic assumption. i 

Because they are random events, unsystematic 
measurement errors are equally likely to be posi- 
tive or negative and will therefore average out to 
zero across a large group of subjects. Thus, a sec- 
ond assumption is that the mean error of measure- 
ment is zero. Classical theory also assumes that 
measurement errors are not correlated with true 
scores. This makes intuitive sense: If the error 
scores were related to another score, it would sug- 
gest that they were systematic rather than random, 
which would violate the essential assumption of 
classical theory. Finally, it is also assumed that 
measurement errors are not correlated with errors 
on other tests. 


We can summarize the main features of classi- 
cal theory as follows (Gulliksen, 1950, chap. 2): 


1. Measurement errors are random. 
2. Mean error of measurement = 0. 
3. True scores and errors are uncorrelated: r7, = 0. 
4. Errors on different tests are uncorrelated: r,, = 0. 


Starting from these assumptions, it is possible to 
develop a number of important implications for re- 
liability and measurement. (The points that follow 
are based on the optimistic assumption that sys- 
tematic measurement errors are minimal or nonex- 
istent for the instrument in question.) For example, 
we know that any test administered to a large group 
of persons will show a variability of obtained 
scores that can be expressed statistically as a vari- 
ance, that is, 02. The value of classical theory is 
that it permits us to partition the variance of ob- 
tained scores into two separate sources. Specifi- 
cally, it can be shown that the variance of obtained 
scores is simply the variance of true scores plus the 
variance of errors of measurement: 
op = 0, + 6,7 

We will refer the interested reader to Gulliksen 
(1950, chap. 3) for the computational details. 

The preceding formula demonstrates that test 
scores vary as the result of two factors: variability 
in true scores, and variability due to measurement 
error. The obvious implication of this relationship 
is that errors of measurement contribute to incon- 
sistency of obtained test scores; results will not re- 
main stable if the test is administered again. 


|| THE RELIABILITY COEFFICIENT 


We are finally in a position to delineate the precise 
relationship between reliability and measurement 
error. By now the reader should have discerned that 
reliability expresses the relative influence of true 
and error scores on obtained test scores. In more 
precise mathematical terms, the reliability coeffi- 
cient (ryy) is the ratio of true score variance to the 
total variance of test scores. That is: 
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or equivalently: 
HARE a 
N ER 

Note that the range of potential values for ryy 
can be derived from analysis of the preceding for- 
mula, Consider what happens when the variance 
due to measurement error ( 0,2) is very small, close 
to zero. In that event, the reliability coefficient (ry) 
approaches a value of (077/07) or 1.0. At the op- 
posite extreme, where the variance due to mea- 
surement error is very large, the value of the 
reliability coefficient becomes smaller, approach- 
ing a theoretical limit of 0.0. In sum, a completely 
unreliable test (large measurement error) will yield 
a reliability coefficient close to 0.0, while a com- 
pletely reliable test (no measurement error) will 
produce a reliability coefficient of 1.0. Thus, the 
possible range of the reliability coefficient is be- 
tween 0.0 and 1.0. In practice, all tests produce re- 
liability coefficients somewhere in between, but the 
closer the value of ry, to 1.0, the better. 

In a literal sense, ryy indicates the proportion of 
variance in obtained test scores that is accounted 
for by the variability in true scores. However, the 
formula for the reliability coefficient ryy indicates 
an additional interpretation of it as well. The reader 
will recall that obtained scores are symbolized by 
Xs. In like manner, the subscripts in the symbol 
for the reliability coefficient signify that r,, is an 
index of the potential or actual consistency of ob- 
tained scores. Thus, tests that capture minimal 
amounts of measurement error produce consistent 
and reliable scores; their reliability coefficients 
are near 1.0. Conversely, tests that reflect large 
amounts of measurement error produce inconsis- 
tent and unreliable scores; their reliability coeffi- 
cients are closer to 0.0. 

Up to this point, the discussion of reliability has 
been conceptual rather than practical. We have 
pointed out that reliability refers to consistency of 
measurement; that reliability is diminished to the 
extent that errors of measurement dominate the 
obtained score; and that one statistical index of 
reliability, the reliability coefficient, can vary be- 
tween 0.0 and 1.0. But how is a statistical measure 
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of reliability computed? We approach this topic in- 
directly, first reviewing an essential statistical tool, 
the correlation coefficient. The reader will discover 
that the correlation coefficient, anumerical index of 
the degree of linear relationship between two sets 
of scores, is an excellent tool for appraising the 
consistency or repeatability of test scores. We pro- 
vide a short refresher on the meaning of correlation 
before proceeding to a summary of methods for 
estimating reliability. 


| THE CORRELATION COEFFICIENT 


In its most common application, a correlation co- 
efficient (r) expresses the degree of linear relation- 
ship between two sets of scores obtained from the 
same persons. Correlation coefficients can take on 
values ranging from —1.00 to +1.00. A correlation 
coefficient of +1.00 signifies a perfect linear rela- 
tionship between the two sets of scores. In particu- 
lar, when two measures have a correlation of +1.00, 
the rank ordering of subjects is identical for both 
sets of scores. Furthermore, when arrayed on a 
scatterplot (Figure 3.10a), the individual data 
points (each representing a pair of scores from a 
single subject) conform to a perfectly straight line 
with an upward slope. A correlation coefficient of 
-1.00 signifies an equally strong relationship, but 
` with inverse correspondence: the highest score on 
one variable corresponding to the lowest score on 
the other, and vice versa. In this case, the individ- 
ual data points conform to a perfectly straight line 
with a downward slope (Figure 3.10b). Correla- 
tions of +1.00 or —1.00 are extremely rare in psy- 
chological research and usually signify a trivial 
finding. For example, if on two occasions in quick 
succession we counted the number of letters in the 
last name of 100 students, these two sets of 
“scores” would show a correlation of +1.00. 
Negative correlations usually result from the 
manner in which one of the two variables was 
scored. For example, scores on the Category Test 
(Reitan & Wolfson, 1993) are reported as errors, 
whereas results on the Raven Progressive Matrices 
(Raven, Court, & Raven, 1983, 1986) are reported 
as number of items correct. Persons who obtain a 


high score on the Category Test (many errors) will 
most likely obtain a low score on the Progressive 
Matrices test (few correct). Thus, we would expect 
a substantial negative correlation for scores on 
these two tests. 

Consider the scatterplot in Figure 3.10c, which 
might depict the hypothetical heights and weights 
of a group of persons. As the reader can see, height 
and weight are strongly but not perfectly related to 
one another. Tall persons tend to weigh more, short 
persons less, but there are some exceptions. If we 
were to compute the correlation coefficient be- 
tween height and weight—a simple statistical task 
outlined in the following—we would obtain a value 
of about +.80, indicating a strong, positive rela- 
tionship between these measures. 

When two variables have no relationship, the 
scatterplot takes on an undefined bloblike shape 
and the correlation coefficient is close to 0.00 (Fig- 
ure 3.10d). For example, in a sample of adults, the 
correlation between reaction time and weight 
would most likely be very close to zero. 





(a) (b) 





FIGURE 3.10 Scatterplots Depicting Different 
Degrees of Correlation 


Finally, it is important to understand that the 
correlation coefficient is independent of the mean. 
For example, a correlation of +1.00 can be found 
between two administrations of the same test even 
when there are significant mean differences be- 
tween pretest and posttest. In sum, perfect correla- 
tion does not imply identical pre- and posttest 
scores for each examinee. However, perfect corre- 
lation does imply perfectly ordered ranking from 
pretest to posttest, as discussed previously. 


THE CORRELATION COEFFICIENT 
| AS A RELIABILITY COEFFICIENT 


One use of the correlation coefficient is to gauge 
the consistency of psychological test scores. If test 
results are highly consistent, then the scores of per- 
sons taking the test on two occasions will be 
strongly correlated, perhaps even approaching the 
theoretical upper limit of +1.00. In this context, the 
correlation coefficient is also a reliability coeffi- 
cient. Even though the computation of the Pearson 
r makes no reference to the theory of true and error 
scores, the correlation coefficient does, nonethe- 
less, reflect the proportion of variance in obtained 
test scores accounted for by the variability in true 
scores. Thus, in some contexts a correlation coeffi- 
cient is a reliability coefficient. 

This discussion introduces one method for es- 
timating the reliability of a test: Administer the in- 
strument twice to the same group of persons and 
compute the correlation between the two: sets of 
scores. The test-retest approach is very common in 
the evaluation of reliability, but several other strate- 
gies exist as well. As we review the following meth- 
ods for estimating reliability, the reader may be 
temporarily bewildered by the apparent diversity of 
approaches, In fact, the different methods fall into 
two broad groups; namely, temporal stability ap- 
proaches, which directly measure the consistency 
of test scores, and internal consistency approaches, 
which rely upon a single test administration to 
gauge reliability. Keep in mind that one common 
theme binds all the eclectic methods together: Re- 
liability is always an attempt to gauge the likely ac- 
curacy or repeatability of test scores. 
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|| RELIABILITY AS 
|| TEMPORAL STABILITY 


Test-Retest Reliability 


The most straightforward method for determining 
the reliability of test scores is to administer the 
identical test twice to the same group of heteroge- 
neous and representative subjects. If the test is per- 
fectly reliable, each person’s second score will be 
completely predictable from his or her first score. 
On many kinds of tests, particularly ability and 
achievement tests, we might expect subjects gener- 
ally to score somewhat higher the second time be- 
cause of practice, maturation, schooling, or other 
intervening effects that take place between pretest 
and posttest. However, so long as the second score 
is strongly correlated with the first score, the exis- 
tence of practice, maturation, or treatment effects 
does not cast doubt upon the test-retest reliability 
of a psychological test. 

An example of a reliability coefficient com- 
puted as a test-retest correlation coefficient is de- 
picted in Figure 3.11. In this case, 60 subjects were 





Finger Tapping Speed, 
First Trial 





30 35 40 45 50 
Finger Tapping Speed, Second Trial 


55 60 .65 70 





FIGURE 3.11- Scatterplot Revealing a Reliability 
Coefficient of .80 

Source; Based on data from Morrison, M. W., Gregory, R. J., & 
Paul, J. J. (1979). Reliability of the Finger Tapping Test and a note 
on sex differences. Perceptual and Motor Skills, 48, 139-142, 
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administered the Finger Tapping Test (FTT) on two 
occasions separated by a week (Morrison, Gregory, 
& Paul, 1979). The FTT, one component of the Hal- 
stead-Reitan neuropsychological test battery (Rei- 
tan & Wolfson, 1985), is a relatively pure measure 
of motor speed. Using a standardized mechanical 
counting apparatus, the subject is instructed to tap 
with the index finger as fast as possible for 10 sec- 
onds. This procedure is continued until five trials in 
arow reveal consistent results. The procedure is re- 
peated for the nondominant hand. The score for each 
hand is the average of the five consecutive trials. 

The correlation between scores from repeated 
administrations of this test works out to be about 
.80. This is at the low end of acceptability for reli- 
ability coefficients, which usually fall in the .80s or 
.90s. We discuss standards of reliability in more de- 
tail subsequently. 


Alternate-Forms Reliability 


In some cases test developers produce two forms of 
the same test. These alternate forms are indepen- 
dently constructed to meet the same specifications, 
often on an item-by-item basis. Thus, alternate 
forms of a test incorporate similar content and 
cover the same range and level of difficulty in 
items. Alternate forms of a test possess similar sta- 
tistical and normative properties. For example, 
when administered in counterbalanced fashion to 
the same group of subjects, the means and standard 
deviations of alternate forms are typically quite 
comparable. 

Estimates of alternate-forms reliability are 
derived by administering both forms to the same 
group and correlating the two sets of scores. This 
approach has much in common with test-retest 
methods—both strategies involve two test admin- 
istrations to the same subjects with an intervening 
time interval. For both approaches, we would ex- 
pect that intervening changes in motivation and in- 
dividual differences in amount of improvement 
would produce fluctuations in test scores and 
thereby reduce reliability estimates somewhat. 
Thus, test-retest and alternate-forms reliability es- 
timates share considerable conceptual similarity. 


However, there is one fundamental difference 
between these two approaches. The alternate-forms 
methodology introduces item-sampling differences 
as an additional source of error variance. That is, 
some test takers may do better or worse on one 
form of a test because of the particular items sam- 
pled. Even though the two forms may be equally 
difficult on average, some subjects may find one 
form quite a bit harder (or easier) than the other be- 
cause supposedly parallel items are not equally fa- 
miliar to every person. Notice that item-sampling 
differences are not a source of error variance in the 
test-retest approach because identical items are 
used in both administrations. 

Alternate forms of a test are also quite expen- 
sive—nearly doubling the cost of publishing a test 
and putting it on the market. Because of the in- 
creased cost and also the psychometric difficulties 
of producing truly parallel forms, fewer and fewer 
tests are being released in this format. 


RELIABILITY AS 
INTERNAL CONSISTENCY 


We turn now to some intriguing ways of estimating 
the reliability of an individual test without devel- 
oping alternate forms and without administering 
the test twice to the same examinees (Feldt & Bren- 
nan, 1989). The first approach correlates the results 
from one-half of the test with the other half and is 
appropriately termed split-half reliability. The sec- 
ond approach examines the internal consistency of 
individual test items. In this method, the psycho- 
metrician seeks to determine whether the test items 
tend to show a consistent interrelatedness. Finally, 
insofar as some tests are less than perfectly reliable 
because of differences among scorers, we also take 
up the related topic of interscorer reliability. 


Split-Half Reliability 


We obtain an estimate of split-half reliability by 
correlating the pairs of scores obtained from equiv- 
alent halves of a test administered only once to a 
representative sample of examinees. The logic of 
split-half reliability is straightforward: If scores on 


two half tests from a single test administration 
show a strong correlation, then scores on two whole 
tests from two separate test administrations (the 
traditional approach to evaluating reliability) also 
should reveal a strong correlation. 

Psychometricians typically view the split-half 
method as supplementary to the gold standard ap- 
proach, which is the test-retest method. For exam- 
ple, in the standardization of the WAIS-III, the 
reliability of most scales was established by the 
test-retest approach and the split-half approach. 
These two estimates of reliability are generally 
similar, although split-half approaches often yield 
higher estimates of reliability. 

One justification for the split-half approach is 
that logistical problems or excessive cost may ren- 
der it impractical to obtain a second set of test 
scores from the same examinees. In this case, a 
split-half estimate of reliability is the only thing 
available, and it is certainly better than no estimate 
at all. Another justification for the split-half ap- 
proach is that the test-retest method is potentially 
misleading in certain cases. For example, some 
ability tests are prone to large but inconsistent 
practice effects—such as when examinees learn 
concepts from feedback given as part of the stan- 
dardized testing procedure. When practice effects 
are large and variable, the rank order of scores 
from a second administration will at best sustain 
only a modest association to the rank order of 
scores from the first administration. For these kinds 
of instruments, test-retest reliability coefficients 
could be misleadingly low. Finally, test-retest ap- 
proaches also will yield misleadingly low estimates 
of reliability if the trait being measured is known 
to fluctuate rapidly (e.g., certain measures of 
mood). 

The major challenge with split-half reliability is 
dividing the test into two nearly equivalent halves. 
For most tests—especially those with the items 
ranked according to difficulty level—the first half is 
easier than the second half. We would not expect ex- 
aminees to obtain equivalent scores on these two 
portions, so this approach to splitting a test rarely is 
used. The most common method for obtaining split 
halves is to compare scores on the odd items versus 
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the even items of the test. This procedure works par- 
ticularly well when the items are arranged in ap- 
proximate order of difficulty. 

In addition to calculating a Pearson r between 
scores on the two equivalent halves of the test, the 
computation of a coefficient of split-half reliability 
entails an additional step: adjusting the half-test re- 
liability using the Spearman-Brown formula. 


The Spearman-Brown Formula 


Notice that the split-half method gives us an esti- 
mate of reliability for an instrument half as long as 
the full test. Although there are some exceptions, a 
shorter test generally is less reliable than a longer 
test. This is especially true if, in comparison to the 
shorter test, the longer test embodies equivalent con- 
tent and similar item difficulty. Thus, the Pearson r 
between two halves of a test will usually underesti- 
mate the reliability of the full instrument. We need 
a method for deriving the reliability of the whole test 
based on the half-test correlation coefficient. 

The Spearman-Brown formula provides the 
appropriate adjustment: 


ltr, 

Fan = —— 
SB 
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In this formula, rop is the estimated reliability of the 
full test computed by the Spearman-Brown method, 
while r,,, is the half-test reliability. Table 3.8 shows 
conceivable half-test correlations alongside the 


TABLE 3.8 Comparison of Split-Half 
Reliabilities and Corresponding Spearman-Brown 
Reliabilities 





Split-Half Spearman-Brown 
Reliability Reliability 

s9 .67 

6 19 

he 82 

8 89 

9 95 
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corresponding Spearman-Brown reliability coeffi- 
cients for the whole test. For example, using the 
Spearman-Brown formula, we could determine that 
a half-test reliability of .70 is equivalent to an esti- 
mated full-test reliability of .82. 


Critique of the Split-Half Approach 


Although the split-half approach is widely used, 

nonetheless it has been criticized for its lack of 

precision: 
Instead of giving a single coefficient for the test, 
the procedure gives different coefficients depend- 
ing on which items are grouped when the test is 
split into two parts. If one split may give a higher 
coefficient than another, one can have little faith in 
whatever result is obtained from a single split. 
(Cronbach, 1951) 


Why rely on a single split? Why not take a more 
typical value such as the mean of the split-half co- 
efficients resulting from all possible splittings of a 
test? Cronbach (1951) advocated just such an ap- 
proach when proposing a general formula for esti- 
mating the reliability of a psychological test. 


Coefficient Alpha 


As proposed by Cronbach (1951) and subsequently 
elaborated by others (Novick & Lewis, 1967; 
Kaiser & Michael, 1975), coefficient alpha may be 
thought of as the mean of all possible split-half co- 
efficients, corrected by the Spearman-Brown for- 
mula. The formula for coefficient alpha is 


se ie Aà =) 
lg il o? 


where r, is coefficient alpha, N is the number of 
items, 02 is the variance of one item, Lo is the 
sum of variances of all items, and 0? is the variance 
of the total test scores. As with all reliability esti- 
mates, coefficient alpha can vary between 0.00 and 
1.00. 

Coefficient alpha is an index of the internal 
consistency of the items, that is, their tendency to 


correlate positively with one another. Insofar as a 
test or scale with high internal consistency will also 
tend to show stability of scores in a test-retest ap- 
proach, coefficient alpha is therefore a useful esti- 
mate of reliability. 

Traditionally, coefficient alpha has been thought 
of as an index of unidimensionality, that is, the de- 
gree to which a test or scale measures a single fac- 
tor. Recent analyses by Cortina (1993) and Schmitt 
(1996) serve to dispel this misconception. Certainly 
coefficient alpha is an index of the interrelatedness 
of the individual items, but this is not synonymous 
with the unidimensionality of what the test or scale 
measures. In fact, it is possible for a scale to mea- 
sure two or more distinct factors and yet still pos- 
sess a very strong coefficient alpha. Schmitt (1996) 
gives the example of a 6-item test in which the first 
three items correlate .8 one with another, the last 
three items also correlate .8 one with another, 
whereas correlations across the two 3-item sets are 
only .3 (Table 3.9). Even though this is irrefutably 
a strong two-factor test, the value for coefficient 
alpha works out to be .86! For this kind of test, co- 
efficient alpha probably will overestimate test-retest 
reliability. This is why psychometricians look to 
test-retest approaches as essential to the evalua- 
tion of reliability. Certainly the split-half approach 
in general, and coefficient alpha in particular are 


TABLE 3.9 A Six-Item Test with Two Factors and 
Strong Coefficient Alpha 


Variable 1 2 3 4 5 6 
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Note: Coefficient alpha = .86 


Source: Reprinted with permission from Schmitt, N. (1996). Uses 
and abuses of coefficient alpha. Psychological Assessment, 8, 
350-353. 


valuable approaches to reliability, but they cannot 
replace the common sense of the test-retest ap- 
proach: When the same test is administered twice to 
a representative sample of examinees, do they ob- 
tain the same relative placement of scores? 


The Kuder-Richardson Estimate of Reliability 


Cronbach (1951) has shown that coefficient alpha is 
the general application of a more specific formula 
developed earlier by Kuder and Richardson (1937). 
Their formula is generally referred to as Kuder- 
Richardson formula 20 or, simply, KR-20, in ref- 
erence to the fact that it was the twentieth in a lengthy 
series of derivations. The KR-20 formula is relevant 
to the special case in which each test item is scored 
Oor 1 (e.g., wrong or right). The formula is 


cate a 


where 


N = the number of items on the test, 
o? = the variance of scores on the total test, 
p =the proportion of examinees getting each 
item correct, 
q = the proportion of examinees getting each 
item wrong. 
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Coefficient alpha extends the Kuder-Richard- 
son method to types of tests with items that are not 
scored as 0 or 1. For example, coefficient alpha 
could be used with an attitude scale in which exam- 
inees indicate on each item whether they strongly 
agree, agree, disagree, or strongly disagree. 


Interscorer Reliability 


Some tests leave a great deal of judgment to the ex- 
aminer in the assignment of scores. Certainly, pro- 
jective tests fall into this category, as do tests of 
moral development and creativity. Insofar as the 
scorer can be a major factor in the reliability of 
these instruments, a report of interscorer reliability 
is imperative. Computing interscorer reliability is 
a very straightforward procedure. A sample of tests 
is independently scored by two or more examiners 
and scores for pairs of examiners are then corre- 
lated. Test manuals typically report the training and 
experience required of examiners and then list rep- 
resentative interscorer correlation coefficients. 

Interscorer reliability supplements other relia- 
bility estimates, but does not replace them. It would 
still be appropriate to assess the test-retest or other 
type of reliability in a subjectively scored test. We 
provide a quick summary of methods for estimat- 
ing reliability in Table 3.10. 


TABLE 3.10 Brief Synopsis of Methods for Estimating Reliability 


No. No. Sources of Error 
Method Forms Sessions Variance 
Test-Retest 1 2 Changes over time 
Alternate-Forms (immediate) 2 1 ‘Item sampling 
Alternate-Forms (delayed) 2 2 Item sampling 
Changes over time 
Split-Half 1 1 Item sampling 
Nature of split 
Coefficient Alpha 1 1 Item sampling 


Interscorer 1 


Test heterogeneity 
Scorer differences 
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Which Type of Reliability Is Appropriate? 


As noted, even when a test has only a single 
form, there are still numerous methods available 
for assessing reliability: test-retest, split-half, co- 
efficient alpha, and interscorer methods. For tests 
that possess two forms, we can add a fifth method: 
alternate-forms reliability. Which method is best? 
When should we use one method but not another? 
To answer these questions, we need to know the 
nature and purpose of the individual test in 
question. 

For tests designed to be administered to in- 
dividuals more than once, it would be reasonable 
to expect that the test demonstrate reliability 
across time—in this case, test-retest reliability is 
appropriate. For tests that purport to possess 
factorial purity, coefficient alpha would be essen- 
tial. In contrast, factorially complex tests such 
as measures of general intelligence would. not 
fare well by measures of internal consistency. 
Thus, coefficient alpha is not an appropriate index 
of reliability for all tests, but applies only to mea- 
sures that are designed to assess a single factor. 
Split-half methods work well for instruments that 
have items carefully ordered according to diffi- 
culty level. Of course, interscorer reliability is 
appropriate for any test that involves subjectivity 
of scoring. 

It is common for test manuals to report multi- 
ple sources of information about reliability. For ex- 
ample, the WAIS-III Manual (Tulsky, Zhu, & 
Ledbetter, 1997) reports split-half reliabilities for 
most subtests and also provides test-retest coeffi- 
cients for all subtests and IQ scores. The manual 
also cites information akin to alternate-forms reli- 
ability—it reports the correlations between the 
WAIS-III and its predecessor, the WAIS-R. 

In order to analyze the error variance into its 
component parts, a number of reliability coeffi- 
cients will need to be computed. Although it is dif- 
ficult to arrive at precise data in the real world, on 
a theoretical basis we can partition the variability of 
scores into true and error components as depicted 
in Figure 3.12. 





Error Variance: 
True Variance: Factors Contributing 
The Enduring and Real to Imprecision of 
Amount of a Trait Measurement 
80% 20% 
Content Changes Interscorer 
Sampling Over Differences 
10% Time 2% 


8% 


Note: The results are similar to what might be found if alternative 
forms of an individual intelligence test were administered to the 
same person by different examiners. 





FIGURE 3.12 Sources of Variance in a 
Hypothetical Test 


ITEM RESPONSE THEORY AND THE 
NEW RULES OF MEASUREMENT 


The classical test theory summarized previously 
dominated test development for most of the twen- 
tieth century. However, beginning slowly in the 
1960s and continuing to the present time, psycho- 
metricians have favored an alternative model of test 
theory and development known as item response 
theory (IRT), or latent trait theory (Embretson, 
1996; Hambleton, Swaminathan, & Rogers, 1991; 
Lord & Novick, 1968; Rasch, 1960). Item response 
theory has many technical advantages in compari- 
son to classical test theory, especially when tests 
are administered by computer—which is increas- 
ingly more common. Many theorists consider IRT 
to be one of the most important developments in 
psychological testing in recent times (e.g., Embret- 
son & Hershberger, 1999). 

When developing a test within the IRT frame- 
work, the psychometrician posits a single dimen- 
sion of skill or underlying trait on which all of the 
test items rely, to some extent, for their correct 


response. Each respondent is hypothesized to have 
a certain amount of the latent trait being measured, 
whether this is verbal proficiency, spatial memory, 
mathematical reasoning, or fine-motor skill. The 
position that each test item occupies on this di- 
mension is referred to as the item difficulty (usually 
denoted b). The position of each respondent on this 
dimension is referred to as his or her proficiency 
(usually denoted 0). 

The appealing advantage of the IRT model is 
that the probability of a respondent answering a 
question correctly can be éxpressed as a precise 
mathematical equation in terms of both b and @. Al- 
though it is beyond the scope of our presentation to 
go into details, the formulas used in IRT test de- 
velopment look something like this: 


pCO) = 1/1 + e9-®) 


where p(@) is the probability of a respondent with 
proficiency @ correctly responding to an item of dif- 
ficulty b. The symbol e in the equation refers to the 
base for natural logarithms, which has a constant 
value of 2.71828. This particular formula was de- 
veloped by the Danish mathematician Georg Rasch 
(1960); hence in his honor this IRT application is 
also known as a Rasch Model. 

When fully explicated, IRT leads to what 
Embretson (1996) has called “the new rules of 
measurement.” By this she means that several con- 
clusions from classical testing theory do not hold 
true within the framework of IRT. For example, 
within classical testing theory, the standard error of 
measurement is assumed to be a constant that ap- 
plies to all examinee scores regardless of the abil- 
ity level of a particular respondent. However, 
within IRT the standard error of measurement be- 
comes substantially larger at both extremes of abil- 
ity. In other words, the IRT model concludes that 
test scores are more reliable for individuals of av- 
erage ability and increasingly less reliable for those 
with very high or very low ability. 

Another difference pertains to the relationship 
between test length and reliability. In classical test 
theory, it is almost an axiom that longer tests are 
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more reliable than shorter tests. For example, this 
follows from the Spearman-Brown formula’ dis- 
cussed earlier in the chapter. However, when IRT 
models are used, shorter tests can be more reliable 
than longer tests. This is especially true when there 
is a good match between the difficulty level of the 
specific items administered and the proficiency 
level of the examinee. A good fit between these two 
parameters allows for a precise (reliable) estimate 
of ability using a relatively smaller number of test 
items. 

In general, tests developed within an IRT model 
are better suited to computerized adaptive testing, 
in which a computer program is used not only to 
administer test items, but also to select them in a 
flexible manner based upon each examinee’s on- 
going responses to prior items. Computerized adap- 
tive testing is discussed in more detail in Topic 
15A, Computerized Assessment and the Future of 
Testing. 


SPECIAL CIRCUMSTANCES IN THE 
ESTIMATION OF RELIABILITY 


Traditional approaches to estimating reliability 
may be misleading or inappropriate for some ap- 
plications. Some of the more problematic situations 
involve unstable characteristics, speed tests, re- 
striction of range, and criterion-referenced tests. 


Unstable Characteristics 


Some characteristics are presumed to be ever 
changing in reaction to situational or physiological 
variables. Emotional reactivity as measured by 
electrodermal or galvanic skin response is a good 
example. Such a measure fluctuates quickly in re- 
action to loud noises, underlying thought pro- 
cesses, and stressful environmental events. Even 
just talking to another person can arouse: a strong 
electrodermal response. Because the true amount 
of emotional reactivity changes so quickly, test 
and retest must be nearly instantaneous in order to 
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provide an accurate index of reliability for unstable 
characteristics such as an electrodermal measure of 
emotional reactivity. 


Speed and Power Tests 


A speed test typically contains items of uniform 
and generally simple levels of difficulty. If time 
permitted, most subjects should be able to complete 
most or all of the items on such a test. However, as 
the name suggests, a speeded test has a restrictive 
time limit that guarantees few subjects complete 
the entire test. Since the items attempted tend to 
be correct, an examinee’s score on a speeded test 
largely reflects speed of performance. 

Speed tests are often contrasted with power 
tests. A power test allows enough time for test tak- 
ers to attempt all items, but is constructed so that no 
test taker is able to obtain a perfect score. Most tests 
contain a mixture of speed and power components. 

The most important point to stress about the re- 
liability of speed tests is that the traditional split- 
half approach (comparing odd and even items) will 
yield a spuriously high reliability coefficient. Con- 
sider one test taker who completes 60 of 90 items 
on a speed test. Most likely, the odd-even approach 
would show 30 odd items correct and 30 even items 
correct. With similar data from other subjects, the 
correlation between scores on odd and even items 
necessarily would approach +1.00. The reliability 
of a speed test should be based on the test-retest 
method or split-half reliability from two, separately 
timed half tests. In the latter instance, the Spear- 
man-Brown correction is needed. 


Restriction of Range 


Test-retest reliability will be spuriously low if it is 
based on a sample of homogeneous subjects for 
whom there is a restriction of range on the char- 
acteristic being measured. For example, it would be 
inappropriate to estimate the reliability of an intel- 
ligence test by administering it twice to a sample of 
college students. This point is illustrated by the hy- 
pothetical but realistic scatterplot shown in Figure 


First Test Score 





Second Test Score 





FIGURE 3.13 Sampling a Restricted Range of Subjects 
Causes Test-Retest Reliability to Be Spuriously Low 


3.13, where the reader can see a strong test-retest 
correlation for the entire range of diverse subjects, 
but a weak correlation for brighter subjects viewed 
in isolation. 


Reliability of Criterion-Referenced Tests 


The reader will recall from the first topic of this 
chapter that criterion-referenced tests evaluate per- 
formance in terms of mastery rather than assessing 
a continuum of achievement. Test items are de- 
signed to identify specific skills that need remedi- 
ation; therefore, items tend to be of the “pass/fail” 
variety. 

The structure of criterion-referenced tests is 
such that the variability of scores among examinees 
is typically quite minimal. In fact, if test results are 
used for training purposes and everyone continues 
training until all test skills are mastered, variability 
in test scores becomes nonexistent. Under these 
conditions, traditional approaches to the assess- 
ment of reliability are simply inappropriate. 

With many criterion-referenced tests, results 
must be almost perfectly accurate to be useful. For 
example, any classification error is serious if the 
purpose of a test is to determine a subject’s ability 
to drive a manual transmission, or stick shift, auto- 


mobile. The key issue here is not whether test and 
retest scores are close to one another, but whether 
the classification (“can do/can’t do”) is the same in 
both instances. What we really want to know is the 
percentage of persons for whom the same decision 
is reached on both occasions—the closer to 100 
percent, the better. This is but one illustration of the 
need for specialized techniques in the evaluation of 
nonnormative tests. Berk (1984) and Feldt and 
Brennan (1989) discuss approaches to the reliabil- 
ity of criterion-referenced tests. 


THE INTERPRETATION OF 
RELIABILITY COEFFICIENTS 


The reader should now be well versed in the dif- 
ferent approaches to reliability and should possess 
at least a conceptual idea of how reliability coeffi- 
cients are computed. In addition, we have discussed 
the distinctive testing conditions that dictate the use 
of one kind of reliability method as opposed to oth- 
ers. No doubt, the reader has noticed that we have 
yet to discuss one crucial question: What is an ac- 
ceptable level of reliability? 

Many authors suggest that reliability should be 
at least .90 if not .95 for decisions about individuals 
(e.g., Salvia & Ysseldyke, 1988; Nunnally & Bern- 
stein, 1994). However, there is really no hard and fast 
answer to this question. We offer the loose guidelines 
suggested by Guilford and Fruchter (1978): 


There has been some consensus that to be a very 
accurate measure of individual differences in some 
characteristic, the reliability should be above .90. 
The truth is, however, that many standard tests with 
reliabilities as low as .70 prove to be very useful. 
And tests with reliabilities lower than that can be 
useful in research. 


On a more practical level, acceptable standards 
of reliability hinge upon the amount of measure- 
ment error the user can tolerate in the proposed ap- 
plication of a test. Fortunately, reliability and 
measurement error are mutually interdependent 
concepts. Thus, if the test user can specify an ac- 
ceptable level of measurement error, then it is also 
possible to determine the minimum standards of re- 
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liability required for that specific application of a 
test. We pursue this topic further by introducing a 
new concept: standard error of measurement. 

RELIABILITY AND THE STANDARD 

ERROR OF MEASUREMENT 
To introduce the concept of standard error of mea- 
surement we begin with a thought experiment. Sup- 
pose we could administer thousands of equivalent 
IQ tests to one individual. Suppose further that each 
test session was a fresh and new experience for our 
cooperative subject; in this hypothetical experi- 
ment, practice and boredom would have no effect 
on later test scores. Nonetheless, because of the 
kinds of random errors discussed in this chapter, the 
scores of our hapless subject would not be identical 
across test sessions. Our examinee might score a lit- 
tle worse on one test because he stayed up late the 
night before; the score on another test might be bet- 
ter because the items were idiosyncratically easy 
for him. Even though such error factors are random 
and unpredictable, it follows from the classical the- 
ory of measurement that the obtained scores would 
fall into a normal distribution with a precise mean 
and standard deviation. Let us say that the mean of 
the hypothetical IQ scores for our subject worked 
out to be 110, with a standard deviation of 2.5. 

In fact, the mean of this distribution of hypo- 
thetical scores would be the estimated true score for 
our examinee. Our best estimate, then, is that our 
subject has a true IQ of 110. Furthermore, the stan- 
dard deviation of the distribution of obtained scores 
would be the standard error of measurement 
(SEM). Note that while the true score on a test 
likely differs from one person to the next, the SEM 
is regarded as constant, an inherent property of the 
test. If we repeated this hypothetical experiment 
with another subject, the estimated true score 
would probably differ, but the SEM should work 
out to be a similar value. ! 





1. This would hold true for subjects of similar age. The SEM 
may differ from one age group to the next—see Wechsler (1997) 
for an illustration with the WAIS-III. 
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As its name suggests, the SEM is an index 
of measurement error that pertains to the test in 
question. In the hypothetical case in which SEM = 
0, there would be no measurement error at all. A 
subject’s obtained score would then also be his or 
her true score. However, this outcome is simply im- 
possible in real-world testing. Every test exhibits 
some degree of measurement error. The larger the 
SEM, the greater the typical measurement error. 
However, the accuracy or inaccuracy of any indi- 
vidual score is always a probabilistic matter and 
never a known quantity. 

As noted, the SEM can be thought of as the 
standard deviation of an examinee’s hypothetical 
obtained scores on a large number of equivalent 
tests, under the assumption that practice and 
boredom effects are ruled out. Like any standard 
deviation of a normal distribution, the SEM has 
well-known statistical uses. For example, 68 
percent of the obtained scores will fall within 
one SEM of the mean, just as 68 percent of the 
cases in a normal curve fall within one SD of 
the mean. 

The reader will recall from earlier in this chap- 
ter that about 95 percent of the cases in a normal 
distribution fall within two SDs of the mean. For 
this reason, if our examinee were to take one more 
IQ test, we could predict with 95 percent odds that 
the obtained score would be within two SEMs of 
the estimated true IQ of 110. Knowing that the 
SEM is 2.5, we would therefore predict that the ob- 
tained IQ score would be 110 + 5; that is, the true 
score would very likely (95 percent odds) fall be- 
tween 105 and 115. 

Unfortunately, in the real world we do not 
have access to true scores and we most certainly 
cannot obtain multiple IQs from large numbers 
of equivalent tests; nor for that matter do we 
have direct knowledge of the SEM. All we typi- 
cally possess is a reliability coefficient (e.g., a 
test-retest correlation from normative studies) plus 
one obtained score from a single test administra- 
tion. How can we possibly use this information to 
determine the likely accuracy of our obtained 
score? 


Computing the Standard Error 
of Measurement 


We have noted several times in this chapter that re- 
liability and measurement error are intertwined 
concepts, with low reliability signifying high mea- 
surement error, and vice versa. It should not sur- 
prise the reader, then, that the SEM can be 
computed indirectly from the reliability coefficient. 
The formula is 


SEM = SD V1 -r 


where SD is the standard deviation of the test 
scores and r is the reliability coefficient, both de- 
rived from a normative sample or other large and 
representative group of subjects. 

We can use WAIS-R Full Scale IQ to illustrate 
the computation of the SEM. The SD of WAIS-R 
scores is known to be about 15, and the reliability 
coefficient is .97 (Wechsler, 1981). The SEM for 
Full Scale IQ is therefore 


SEM = 15 v1-.97 


which works out to be about 2.5. 


The SEM and Individual Test Scores 


Let us consider carefully what the SEM tells us 
about individual test results, once again using 
WAIS-R IQs to illustrate a general point. What we 
would really like to know is the likely accuracy of 
IQ. Let us say we have an individual examinee who 
obtains a score of 90, and let us assume that the test 
was administered in competent fashion. Nonethe- 
less, is the obtained IQ score likely to be accurate? 

In order to answer this question, we need to 
rephrase it. In the jargon of classical test theory, 
questions of accuracy really involve comparisons 
between obtained scores and true scores. Specifi- 
cally, when we inquire whether an IQ score is ac- 
curate, we are really asking: How close is the 
obtained score to the true score? 

The answer to this question may seem perturb- 
ing at first glance. It turns out that, in the individ- 
ual case, we can never know precisely how close 


the obtained score is to the true score! The best we 
can do is provide a probabilistic statement based on 
our knowledge that the hypothetical obtained 
scores for a single examinee would be normally 
distributed with a standard deviation equal to the 
SEM. Based on this premise, we know that the ob- 
tained score is accurate to within plus or minus 2 
SEMs in 95 percent of the cases. In other words, 
Full Scale IQ is 95 percent certain to be accurate 
within +5 IQ points. This range of plus or minus 5 
IQ points corresponds to the 95 percent confidence 
interval for WAIS-R Full Scale IQ, because we can 
be 95 percent confident that the true score is con- 
tained within it. 

Testers would do well to report test scores in 
terms of a confidence interval, because this practice 
would help place scores in proper perspective 
(Sattler, 1988). An examinee who obtains an IQ of 
90 should be described as follows: “Mr. Doe ob- 
tained a Full Scale IQ of 90 which is accurate to 
+5 points with 95 percent confidence.” This word- 
ing helps forewarn others that test scores always in- 
corporate some degree of measurement error. 


The SEM and Differences between Scores 


Testers are often expected to surmise whether an 
examinee has scored significantly higher in one 
ability area than another. For example, it is usually 
germane to report whether an examinee is stronger 
at verbal or performance tasks or to say that no real 
difference exists in these two skill areas. The issue 
is not entirely academic. An examinee who has a 
relative superiority in performance. intelligence 
might be counseled to pursue practical, hands-on 
careers. In contrast, a strength in verbal intelligence 
might result in a recommendation to pursue acade- 
mic interests. How is an examiner to determine 
whether one test score is significantly better than 
another? 

Keep in mind that every test score incorporates 
measurement error. It is therefore possible for an 
examinee to obtain a verbal score higher than his or 
her performance score when the underlying true 
scores—if only we could know them— would re- 


TOPIC 3B CONCEPTS OF RELIABILITY 93 








Verbal 


obtained 
IQ rs 


score 






true 





Performance 


Note: In this hypothetical case the obtained Verbal IQ is higher 
than the obtained Performance IQ, whereas the underlying true 
scores show the opposite pattern. 





FIGURE 3.14 Obtained Scores Reflect Measurement 
Error and May Therefore Obscure the Relationship 
between True Scores 


veal no difference or even the opposite pattern! 
(See Figure 3.14.) The important lesson here is that 
when each of two obtained scores reflects mea- 
surement error, the difference between these scores 
is quite volatile and must not be overinterpreted. 

The standard error of the difference between 
two scores is a statistical measure that can help a 
test user determine whether a difference between 
scores is significant. The standard error of the dif- 
ference between two scores can be computed from 
the SEMs of the individual tests by the following 
formula: 


SE sie = WSEM, Y + (SEM, 


where SE ‚;„ is the standard error of the difference 
and SEM, and SEM, are the respective standard er- 
rors of measurement. 

It is assumed that the two scores are on the 
same scale or have been converted to the same 
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scale. That is, the tests must have the same overall 
mean and standard deviation in the normative 
sample. By substituting SDV1 - r,, for SEM, and 
SDV1—r,, for SEM,, we arrive at 


SE sige = sp = rı ta 


We return to our original question to illustrate 
the computation and use of SE 4; How is an exam- 
iner to determine whether one test score is signifi- 
cantly better than another? In particular, suppose an 
examinee obtains Verbal IQ 112 and Performance 
IQ 105 on the WAIS-R. Is 7 IQ points a significant 
difference? 

We know from the WAIS-R Manual (Wechsler, 
1981) that Verbal and Performance IQ each have 
standard deviations of approximately 15; and their 
respective reliabilities are .97 and .93. The standard 


error of the difference between these two scores can 
be found from 


SE = 15V2 - .97- 93 = 4.74 


Recall from the discussion of normal distributions 
that 5 percent of the cases occur in the tails, beyond 
+1.96 standard deviations. Thus, differences that 
are approximately twice as large as SE, (that is, 
1.96 x 4.74) can be considered significant in the 
sense that they will occur by chance only 5 percent 
of the time. We may conclude, then, that differ- 
ences of about 9 points or more between Verbal and 
Performance IQ likely reflect real differences in 
scores rather than chance contributions from errors 
of measurement. Thus, more likely than not, a dif- 
ference of merely 7 IQ points does not signify a 
bona fide, significant difference between verbal 
and performance intelligence. 


SUMMARY 


1. In psychological testing, reliability refers 
to the attribute of consistency of measurement. Few 
behavioral measurements are perfectly reliable— 
some degree of inconsistency is almost always 
present from one measurement to the next. Relia- 
bility should be viewed as a continuum. 


2. According to the classical theory of true 
and error scores, any test score reflects the influ- 
ence of two factors: factors that contribute to con- 
sistency, namely, the stable attributes that the 
examiner seeks to measure; and factors that con- 
tribute to inconsistency, which include subject, test, 
and situational variables. 


3. The fundamental equation of classical 
measurement theory is 


X=T+e 


where X is the obtained score, T is the true score, 
and e represents errors of measurement. 


4. Measurement error can arise during item 
selection, test administration, and test scoring. Sys- 
tematic errors may also contribute to measurement 
error. An example of measurement error caused by 


item selection: In the process of item selection, the 
test developer may pick items that are not equally 
fair to all persons. 


5. Systematic errors of measurement may arise 
when, unknown to the test developer, a test consis- 
tently measures something other than the trait for 
which it was intended. For example, a test designed 
to measure social introversion might inadvertently 
measure anxiety in a consistent manner. 


6. The basic assumptions of classical mea- 
surement theory are (1) measurement errors are 
random, (2) the mean error of measurement is zero, 
(3) true scores and error scores are uncorrelated, 
and (4) errors on different tests are uncorrelated. It 
follows from these assumptions that the variance of 
obtained scores is simply the variance of true scores 
plus the variance of errors of measurement. 


7. Reliability expresses the relative influence 
of true and error scores on obtained test scores. The 
reliability coefficient is the ratio of true score vari- 
ance to the total variance of test scores (true score 
variance plus error score variance). The value of the 
reliability coefficient can vary between 0.0 and 1.0. 


8. The Pearson product-moment correlation 
coefficient can be used to gauge the consistency of 
psychological test scores. This form of reliability is 
referred to as test-retest reliability. Alternate-forms 
reliability is computed by correlating scores on two 
equivalent forms, administered in counterbalanced 
fashion to a large group of heterogeneous subjects. 


9. Internal consistency approaches to relia- 
bility include split-half reliability, in which scores 
on half tests are correlated with each other, and co- 
efficient alpha, which can be thought of as the mean 
of all possible split-half coefficients. 


10. For tests that require examiner judgment 
for assignment of scores, interscorer reliability 
is needed. Computing interscorer reliability is 
straightforward: A sample of tests is independently 
scored by two or more examiners and scores for 
pairs of examiners are then correlated. 


11. Item response theory (IRT) has been re- 
placing classical test theory as the preferred model 
for test development. IRT posits a single dimension 
of skill or underlying trait on which all of the test 
items rely, and hypothesizes that each respondent 
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has a certain amount of the latent trait being mea- 
sured. This allows for precise formulas linking the 
probability of a correct response to item difficulty 
and the respondent’s level of the latent trait. 


12. Traditional approaches to estimating re- 
liability may be misleading or inappropriate for 
these applications: when the characteristic mea- 
sured is highly volatile or unstable; for speeded 
tests with simple item difficulty; when the subjects 
are highly homogeneous for the characteristic 
being measured. 


13. For many criterion-referenced tests, the re- 
sults must be almost perfectly reliable to be useful. 
Since criterion-referenced tests often have a “can 
do/can’t do” quality, repeatability of classification 
is one method of assessing the reliability of crite- 
rion-referenced tests. 


14. Reliability is inversely related to the stan- 
dard error of measurement (SEM). The SEM de- 
termines the confidence interval that surrounds 
every examinee’s score. For example, the 95 per- 
cent confidence interval is +2 SEMs from the ex- 
aminee’s obtained score. 
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Summary 
Key Terms and Concepts 


A: most every student of psychology knows, 
the merit of a psychological test is deter- 


mined first by its reliability but then ultimately by 
its validity. In the preceding chapter we pointed out 
that reliability can be appraised by many seemingly 
diverse methods ranging from the conceptually 
straightforward test-retest approach to the theoret- 
ically more complex methodologies of internal 
consistency. Yet, regardless of the method used, the 
assessment of reliability invariably boils down to a 
simple summary statistic, the reliability coeffi- 
cient. In this chapter, the more difficult and com- 
plex issue of validity—what a test score means—is 
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investigated. The concept of validity is still evolv- 
ing and therefore stirs up a great deal more contro- 
versy than its staid and established cousin, 
reliability (AERA, APA, & NCME, 1999). In Topic 
4A, Basic Concepts of Validity, we introduce es- 
sential concepts of validity, including the standard 
tripartite division into content, criterion-related, 
and construct validity. We also discuss extravalid- 
ity concerns, which include side effects and un- 
intended consequences of testing. Extravalidity 
concerns have fostered a wider definition of test va- 
lidity that extends beyond the technical notions of 
content, criteria, and constructs. In Topic 4B, Test 
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Construction, we stress that validity must be built 
into the test from the outset rather than being lim- 
ited to the final stages of test development. 

Put simply, the validity of a test is the extent 
to which it measures what it claims to measure. 
Psychometricians have long acknowledged that 
validity is the most fundamental and important 
characteristic of a test. After all, validity defines 
the meaning of test scores. Reliability is important, 
too, but only insofar as it constrains validity. To the 
extent that a test is unreliable, it cannot be valid. 
We can express this point from an alternative per- 
spective: Reliability is a necessary but not a suffi- 
cient precursor of validity. 

Test developers have a responsibility to dem- 
onstrate that new instruments fulfill the purposes 
for which they are designed. However, unlike test 
reliability, test validity is not a simple issue that is 
easily resolved on the basis of a few rudimentary 
studies. Test validation is a developmental process 
that begins with test construction and continues in- 
definitely: 


After a test is released for operational use, the in- 
terpretive meaning of its scores may continue to be 
sharpened, refined, and enriched through the grad- 
ual accumulation of clinical observations and 
through special research projects. . . . Test validity 
is a living thing; it is not dead and embalmed when 
the test is released. (Anastasi, 1986) 


Test validity hinges upon the accumulation of re- 
search findings (Case Exhibit 4.1). In the sections 
that follow, we examine the kinds of evidence 
sought in the validation of a psychological test. 


N VALIDITY: A DEFINITION 


We begin with a definition of validity, paraphrased 
from the influential Standards for Educational and 
Psychological Testing (AERA, APA, & NCME, 
1985, 1999): 


A test is valid to the extent that inferences 
made from it are appropriate, meaningful, and 
useful. 


Notice that a test score per se is meaningless until 
the examiner draws inferences from it based on the 
test manual or other research findings. For exam- 
ple, knowing that an examinee has obtained a 
slightly elevated score on the MMPI-2 Depression 
scale is not particularly helpful. This result be- 
comes valuable only when the examiner infers 
behavioral characteristics from it. Based on exist- 
ing research, the examiner might conclude, “The 
elevated Depression score suggests that the exam- 
inee has little energy and has a pessimistic outlook 
on life.” The MMPI-2 Depression scale possesses 
psychometric validity to the extent that such infer- 
ences are appropriate, meaningful, and useful. 
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Unfortunately, it is seldom possible to summa- 
rize the validity of a test in terms of a single, tidy 
statistic. Determining whether inferences are ap- 
propriate, meaningful, and useful typically requires 
numerous studies of the relationships between test 
performance and other independently observed be- 
haviors. Validity reflects an evolutionary, research- 
based judgment of how adequately a test measures 
the attribute it was designed to measure. Conse- 
quently, the validity of tests is not easily captured 
by neat statistical summaries but is instead charac- 
terized on a continuum ranging from weak to ac- 
ceptable to strong. 

Traditionally, the different ways of accumulat- 
ing validity evidence have been grouped into three 
categories: 


e Content validity 
e Criterion-related validity 
* Construct validity 


We will expand on this tripartite view of validity 
shortly, but first a few cautions. The use of these 
convenient labels does not imply that there are dis- 
tinct types of validity or that a specific validation 
procedure is best for one test use and not another: 


An ideal validation includes several types of evi- 
dence, which span all three of the traditional cate- 
gories. Other things being equal, more sources of 
evidence are better than fewer. However, the qual- 
ity of the evidence is of primary importance, and a 
single line of solid evidence is preferable to numer- 
ous lines of evidence of questionable quality. Pro- 
fessional judgment should guide the decisions 
regarding the forms of evidence that are most nec- 
essary and feasible in light of the intended uses of 
the test and any likely alternatives to testing. 
(AERA, APA, & NCME, 1985) 


We may summarize these points by stressing 
that validity is a unitary concept determined by the 
extent to which a test measures what it purports to 
measure. The inferences drawn from a valid test 
are appropriate, meaningful, and useful. In. this 
light, it should be apparent that virtually any em- 
pirical study that relates test scores to other find- 
ings is a potential source of validity information 
(Anastasi, 1986; Messick, 1995). 


| CONTENT VALIDITY 


Content validity is determined by the degree to 
which the questions, tasks, or items on a test are rep- 
resentative of the universe of behavior the test was 
designed to sample. In theory, content validity is re- 
ally nothing more than a sampling issue (Bausell, 
1986). The items of a test can be visualized as a 
sample drawn from a larger population of potential 
items that define what the researcher really wishes 
to measure. If the sample (specific items on the test) 
is representative of the population (all possible 
items), then the test possesses content validity. 

Content validity is a useful concept when a 
great deal is known about the variable that the re- 
searcher wishes to measure. With achievement 
tests in particular, it is often possible to specify the 
relevant universe of behaviors in advance. For ex- 
ample, when developing an achievement test of 
spelling, a researcher could identify nearly all pos- 
sible words that third graders should know. The 
content validity of a third-grade spelling achieve- 
ment test would be assured, in part, if words of 
varying difficulty level were randomly sampled 
from this preexisting list. 

However, test developers must take care to 
specify the relevant universe of responses as well. 
All too often, a multiple-choice format is taken for 
granted: 


If the constructor thinks about his aims with an 
open mind he will often decide that the task should 
call for a response constructed by the student— 
written open-end responses or, if inhibitions are to 
be minimized, oral responses. Nor are the direc- 
tions to the subject and the social setting of the test 
to be neglected in defining the task. (Cronbach, 
1971) 


In reference to spelling achievement, it cannot be 
assumed that a multiple-choice test will measure 
the same spelling skills as an oral test or a fre- 
quency count of misspellings in written composi- 
tions. Thus, when evaluating content validity, 
response specification is also an integral part of 
defining the relevant universe of behaviors. 
Content validity is more difficult to assure 
when the test measures an ill-defined trait. How 
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Resiewerer [2 ie aero bree set er 3 SB ates 


Please read carefully through the domain specification for this test. Next, please indicate 
how well you feel each item reflects the domain specification. Judge a test item solely on 


the basis of match between its content and the content defined by the domain 
specification. Please use the four-point rating scale shown below: 


1 2 3 


not relevant somewhat relevant quite relevant 





could a test developer possibly hope to specify the 
universe of potential items for a measure of anxi- 
ety? In these cases in which the measured trait is 
less tangible, no test developer in his or her right 
mind would try to construct the literal universe of 
potential test items. Instead, what usually passes 
for content validity is the considered opinion of ex- 
pert judges. In effect, the test developer asserts that 
“a panel of experts reviewed the domain specifica- 
tion carefuliy and judged the following test ques- 
tions to possess content validity.” Figure 4.1 
reproduces a sample judge’s item rating form for 
determining the content validity of test questions. 


Quantification of Content Validity 


Lawshe (1975), Martuza (1977), and others have 
discussed statistical methods for determining the 
overall content validity of a test from the judg- 
ments of experts. These methods tend to be very 
specialized and have not been widely accepted. 
Nonetheless, their approaches can serve as amodel 
for a commonsense viewpoint on interrater agree- 
ment as a basis for content validity. 

When two expert judges evaluate individual 
items of a test on the four-point scale proposed in 
Figure 4.1, the ratings of each judge on each item 
can be dichotomized into weak relevance (ratings 
of 1 or 2) versus strong relevance (ratings of 3 or 
4). For each item, then, the conjoint ratings of the 
two judges can be entered into the two-by-two 
agreement table depicted in Figure 4.2. For exam- 
ple, if both judges believed an item was quite rele- 
vant (strong relevance), it would be placed in cell 


very relevant 


FIGURE 4.1 

Sample Judges Item- 
Rating Form for Deter- 
4 mining Content Validity 
Source: Based on Martuza 
(1977), Hambleton (1984), 
Bausell (1986). 


D. If the first judge believed an item was very 
relevant (strong relevance) but the second judge 
deemed it be only slightly relevant (weak rele- 
vance), the item would be placed in cell B. 

Notice that cell D is the only cell that reflects 
valid agreement between judges. The other cells 
involve disagreement (cells B and C) or agree- 
ment that an item doesn’t belong on the test (cell 
A). We have reproduced hypothetical results for 
a 100-item test in Figure 4.3. A coefficient of con- 
tent validity can be derived from the following 
formula: 


content validity = (A+B+C+D) 





EXPERT JUDGE #1 


Weak Strong 


Relevance Relevance 
(item rated (item rated 
1or2) 


3or4) 


EXPERT 

JUDGE #2 
(item rated 
30or4) 








FIGURE 4.2 Interrater Agreement Model for 
Content Validity * 
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EXPERT (item rated 4items | 5items 
1or2) 
J Es Strong Relevance 
(item rated 4items | 87 items 
3or4) 





FIGURE 4.3 Hypothetical Example of Agreement 
Model of Content Validity for a 100-Item Test 


For example, on our 100-item test both judges 
concurred that 87 items were strongly relevant 
(cell D), so the coefficient of content validity 
would be 87/(4 + 4 + 5 + 87) or .87. If more than 
two judges are used, this computational procedure 
could be completed with all possible pair-wise 
combinations of judges, and the average coeffi- 
cient reported. An important note: A coefficient of 
content validity is just one piece of evidence in the 
evaluation of a test. Such a coefficient does not by 
itself establish the validity of a test. 

The commonsense approach to content validity 
advocated. here serves well as a flagging mecha- 
nism to help cull out existing items that are deemed 
inappropriate by expert raters. However, it cannot 
identify nonexistent items that should be added to 
a test to help make the pool of questions more rep- 
resentative of the intended domain. A test could 
possess a robust coefficient of content validity and 
still fall short in subtle ways. Quantification of 
content validity is no substitute for careful selec- 
tion of items. 


Face Validity 


We digress briefly here to mention face validity, 
which is not really a form of validity at all. None- 
theless, the concept is encountered in testing and 


therefore needs brief explanation. A test has face 
validity if it looks valid to test users, examiners, 
and especially the examinees. Face validity is 
really a matter of social acceptability, and not a 
technical form of validity in the same category 
as content, criterion-related, or construct validity 
(Nevo, 1985). From a public relations standpoint, 
it is crucial that tests possess face validity—other- 
wise those who take the tests may be dissatisfied 
and doubt the value of psychological testing. How- 
ever, face validity should not be confused with 
objective validity, which is determined by the rela- 
tionship of test scores to other sources of informa- 
tion. In fact, a test could possess extremely strong 
face validity—the items might look highly relevant 
to what is presumably measured by the instru- 
ment—yet produce totally meaningless scores 
with no predictive utility whatever. 


| CRITERION-RELATED VALIDITY 


Criterion-related validity is demonstrated when 
a test is shown to be effective in estimating an ex- 
aminee’s performance on some outcome measure. 
In this context, the variable of primary interest is 
the outcome measure, called a criterion. The test 
score is useful only insofar as it provides a basis for 
accurate prediction of the criterion. For example, a 
college entrance exam that is reasonably accurate 
in predicting the subsequent grade point average of 
examinees would possess criterion-related validity. 

Two different approaches to validity evidence 
are subsumed under the heading of criterion-related 
validity. In concurrent validity, the criterion mea- 
sures are obtained at approximately the same time as 
the test scores. For example, the current psychiatric 
diagnosis of patients would be an appropriate crite- 
rion measure to provide validation evidence for a 
paper-and-pencil psychodiagnostic test. In predic- 
tive validity, the criterion measures are obtained in 
the future, usually months or years after the test 
scores are obtained, as with the college grades pre- 
dicted from an entrance exam. Each of these two ap- 
proaches is best suited to different testing situations, 
discussed in the following sections. However, before 
we review the nature of concurrent and predictive 


validity, let us examine a more fundamental ques- 
tion: What are the characteristics ofa good criterion? 


Characteristics of a Good Criterion 


As noted, a criterion is any outcome measure 
against which a test is validated. In practical terms, 
a criterion can be most anything. Some examples 
will help to illustrate the diversity of potential cri- 
teria. A simulator-based driver skill test might be 
validated against a criterion of “number of traffic 
citations received in the last twelve months.” A 
scale measuring social readjustment might be val- 
idated against a criterion of “number of days spent 
in a psychiatric hospital in the last three years.” A 
test of sales potential might be validated against a 
criterion of “dollar amount of goods sold in the 
preceding year.” The choice of criteria is circum- 
scribed, in part, by the ingenuity of the test devel- 
oper. However, criteria must be more than just 
imaginative, they must also be reliable, appropri- 
ate, and free of contamination from the test itself. 

The criterion must itself be reliable if it is to be a 
useful index of what the test measures. If you recall 
the meaning of reliability—consistency of scores— 
the need for a reliable criterion measure is intuitively 
obvious. After all, unreliable means unpredictable. 
An unreliable criterion will be inherently unpre- 
dictable, regardless of the merits of the test. 

Consider the case in which scores on a college 
entrance exam (the test) are used to predict subse- 
quent grade point average (the criterion). The va- 
lidity of the entrance exam could be studied by 
computing the correlation (r,,) between entrance 
exam scores and grade point averages for a repre- 
sentative sample of students. For purposes of a va- 
lidity study, it would be ideal if the students were 
granted open or unscreened enrollment so as to 
prevent a restriction of range on the criterion vari- 
able. In any case, the resulting correlation coeffi- 
cient is called a validity coefficient.' 


1. We have purposefully refrained from referring to such a sta- 
tistic as the validity coefficient. Remember that validity is a uni- 
tary concept determined by multiple sources of information that 
may include the correlation between test and criterion. 
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The theoretical upper limit of the validity coef- 
ficient is constrained by the reliability of both the 
test and the criterion: 


Igy = rl) 


xy 


The validity coefficient is always less than or equal 
to the square root of the test reliability multiplied 
by the criterion reliability. In other words, to the 
extent that the reliability of either the test or the cri- 
terion (or both) is low, the validity coefficient is 
also diminished. Returning to our example of an 
entrance exam used to predict college grade point 
average, we must conclude that the validity coeffi- 
cient for such a test will always fall far short of 
+1.00, owing in part to the unreliability of college 
grades and also in part to the unreliability of the 
test itself. 

A criterion measure must also be appropriate 
for the test under investigation. The Standards for 
Educational and Psychological Testing source- 
book (AERA, APA, & NCME, 1985) incorporates 
this important point as a separate standard: 


All criterion measures should be described accu- 
rately, and the rationale for choosing them as rele- 
vant criteria should be made explicit. 


For example, in the case of interest tests, it is some- 
times unclear whether the criterion measure should 
indicate satisfaction, success, or continuance in the 
activities under question. The choice between 
these subtle variants in the criterion must be made 
carefully, based on an analysis of what the interest 
test purports to measure. 

A criterion must also be free of contamination 
from the test itself. Lehman (1978) has illustrated 
this point in a criterion-related validity study of 
a life change measure. The Schedule of Recent 
Events, or the SRE (Holmes & Rahe, 1967) is 
a widely used instrument that provides a quantita- 
tive index of the accumulation of stressful life 
events (e.g., divorce, job promotion, traffic tick- 
ets). Scores on the SRE correlate modestly with 
such criterion measures as physical illness and 
psychological disturbance. However, many seem- 
ingly appropriate criterion measures incorporate 
items that are similar or identical to SRE items. For 
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example, screening tests of psychiatric symptoms 
often check for changes in eating, sleeping, or so- 
cial activities. Unfortunately, the SRE incorporates 
questions that check for the following: 


Change in eating habits 
Change in sleeping habits 
Change in social activities 


If the screening test contains the same items as the 
SRE, then the correlation between these two mea- 
sures will be artificially inflated, This potential 
source of error in test validation is referred to as 
criterion contamination, since the criterion is 
“contaminated” by its artificial commonality with 
the test. 

Criterion contamination is also possible when 
the criterion consists of ratings from experts. If the 
experts also possess knowledge of the examinees’ 
test scores, this information may (consciously or 
unconsciously) influence their ratings. When vali- 
dating a test against a criterion of expert ratings, 
the test scores must be held in strictest confidence 
until the ratings have been collected. 

Now that the reader knows the general charac- 
teristics of a good criterion, we will review the ap- 
plication of this knowledge in the analysis of 
concurrent and predictive validity. 


Concurrent Validity 


In a concurrent validation study, test scores and cri- 
terion information are obtained simultaneously. 
Concurrent evidence of test. validity is usually de- 
sirable for achievement tests, tests used for licens- 
ing or certification, and diagnostic clinical tests, 
An evaluation of concurrent validity indicates the 
extent to which test scores accurately estimate an 
individual’s present position on the relevant crite- 
rion. For example, an arithmetic achievement test 
would possess concurrent validity if its scores 
could be used to predict, with reasonable accuracy, 
the current standing of students in a mathematics 
course. A personality inventory would possess 
concurrent validity if diagnostic classifications de- 
rived from it roughly matched the opinions of psy- 
‘chiatrists or clinical psychologists. 


A test with demonstrated concurrent validity 
provides a shortcut for obtaining information that 
might otherwise require the extended investment 
of professional time. For example, the case assign- 
ment procedure in a mental health clinic can be ex- 
pedited if a test with demonstrated concurrent 
validity is used for initial screening decisions. In 
this manner, severely disturbed patients requiring 
immediate clinical workup and intensive treatment 
can be quickly identified by paper-and-pencil test. 
Of course, tests are not intended to replace mental 
health specialists, but they can save time in the ini- 
tial phases of diagnosis. 

Correlations between a new test and existing 
tests are often cited as evidence of concurrent va- 
lidity. This has a catch-22 quality to it—old tests 
validating a new test—but is nonetheless appropri- 
ate if two conditions are met. First, the criterion 
(existing) tests must have been validated through 
correlations with appropriate nontest behavioral 
data. In other words, the network of interlocking 
relationships must touch ground with real-world 
behavior at some point. Second, the instrument 
being validated must measure the same construct 
as the criterion tests, Thus, it is entirely appropri- 
ate that developers of a new intelligence test report 
correlations between it and established mainstays 
such as the Stanford-Binet and Wechsler scales. 


Predictive Validity 


In a predictive validation study, test scores are used 
to estimate outcome measures obtained at a later 
date. Predictive validity is particularly relevant for 
entrance examinations and employment tests. Such 
tests share a common function—determining who 
is likely to succeed at a future endeavor. A relevant 
criterion for a college entrance exam would be 
first-year-student grade point average, while an 
employment test might be validated against super- 
visor ratings after six months on the job. In the 
ideal situation, such tests are validated during peri- 
ods of open enrollment (or open hiring) so that a 
full range of results is possible on the outcome 
measures. In this manner, future use of the test as a 
selection device for excluding low-scoring appli- 


cants will rest on a solid foundation of validational 
data. 

When tests are used for purposes of prediction, 
it is necessary to develop a regression equation. A 
regression equation describes the best-fitting 
straight line for estimating the criterion from the 
test. We will not discuss the statistical approach to 
fitting the straight line, except to mention that it 
minimizes the sum of the squared deviations from 
the line (Ghiselli, Campbell, & Zedeck, 1981). For 
current purposes, it is more important to understand 
the nature and function of regression equations. 

Ghiselli and associates (1981) provide a simple 
example of regression in the service of prediction, 
summarized here. Suppose we are trying to predict 
success on a job Y (evaluated by the supervisor on 
a 7-point scale ranging from poor to excellent per- 
formance) from scores on a preemployment test X 
(with scores that range from a low of 0 to a high of 
100). The regression equation 


Y=.07X + .2 


might describe the best-fitting straight line and 
therefore produce the most accurate predictions. 
For an individual who scored 55 on the test, the 
predicted performance level would be 4.05; that is, 
.07(55) + .2. A test score of 33 yields a predicted 
performance level of 2.51; that is, .07(33) + .2. Ad- 
ditional predictions are made likewise. 


Validity Coefficient and the 
Standard Error of the Estimate 


The relationship between test scores and criterion 
measures can be expressed in several different 
ways. Perhaps the most popular approach is to 
compute the correlation between test and criterion 
(r). In this context, the resulting correlation is 
known as a validity coefficient. The higher the va- 
lidity coefficient r, the more accurate is the test in 
predicting the criterion. In the hypothetical case 
where r,, is 1.00, the test would possess perfect va- 
lidity and allow for flawless prediction. Of course, 
no such test exists, and validity coefficients are 
more commonly in the low- to midrange of corre- 
lations and rarely exceed .80. But how high should 
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a validity coefficient be? There is no general an- 
swer to this question. However, we can approach 
the question indirectly by investigating the rela- 
tionship between the validity coefficient and the 
corresponding error of estimate. 

The standard error of estimate (SE,,,) is the 
margin of error to be expected in the predicted cri- 
terion score. The error of estimate is derived from 
the following formula: 


SE... = SD,\I -r,2 


In this formula, r _? is the square of the validity co- 
efficient and SD, is the standard deviation of the 
criterion scores. Perhaps the reader has noticed the 
similarities between this index and the standard 
error of measurement (SEM). In fact, both indices 
help gauge margins of error. The SEM indicates 
the margin of measurement error caused by unreli- 
ability of the test, whereas SE, , indicates the mar- 
gin of prediction error caused by the imperfect 
validity of the test. 
The SE,., helps answer the fundamental ques- 
tion: “How accurately can criterion performance 
be predicted from test scores?” (AERA, APA, & 
NCME, 1985). Consider the common practice of 
attempting to predict college grade point average 
from high school scores on a scholastic aptitude 
test. For a specific aptitude test, suppose we deter- 
mine that the SE, for predicted grade point aver- 
age is .2 (on the usual 0.0 to 4.0 grade point scale). 
What does this mean for the examinee whose col- 
lege grade point is predicted to be 3.1? As is the 
case with all standard deviations, the standarderror 
of the estimate can be used to bracket predicted 
outcomes in a probabilistic sense. Assuming that 
the frequency distribution of grades is normal, we 
know that the chances are about 68 in 100 that the 
examinee’s predicted grade point will fall between 
2.9 and 3.3 (plus or minus one SE, ‚): In like man- 
ner, we know that the chances are about 95 in 100 
that the examinee’s predicted grade point will fall 
between 2.7 and 3.5 (plus or minus two SE,,.,). 
What is an acceptable standard of predictive 
accuracy? There is no simple answer to this ques- 
tion. As the reader will discern from the discussion 
that follows, standards of predictive accuracy are, 
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in part, value judgments. To explain why this is so, 
we need to introduce the basic elements of deci- 
sion theory (Taylor & Russell, 1939; Cronbach & 
Gleser, 1965). 


Decision Theory Applied 
to Psychological Tests 


Proponents of decision theory stress that the pur- 
pose of psychological testing is not measurement 
per se but measurement in the service of decision 
making. The personnel manager wishes to know 
who to hire; the admissions officer must choose 
who to admit; the parole board desires to know 
which felons are good risks for early release; and 
the psychiatrist needs to determine which patients 
require hospitalization. 

The link between testing and decision making 
is nowhere more obvious than in the context of pre- 
dictive validation studies. Many of these studies 
use test results to determine who will likely suc- 
ceed or fail on the criterion task so that, in the fu- 
ture, examinees with poor scores on the predictor 
test can be screened from admission, employment, 
or other privilege. This is the rationale by which ad- 
missions officers or employers require applicants 
to obtain a certain minimum score on an appropri- 
ate entrance or employment exam—previous stud- 
ies of predictive validity can be cited to show that 
candidates scoring below a certain cutoff face steep 
odds in their educational or employment pursuits. 

Psychological tests frequently play a major 
role in these kinds of institutional decision making. 
In a typical institutional decision, a committee—or 
sometimes a single person—makes a large number 
of comparable decisions based on a cutoff score on 
one or more selection tests. In order to present the 
key concepts of decision theory, let us oversim- 
plify somewhat and assume that only a single test 
is involved. 

Even though most tests produce a range of 
scores along a continuum, it is usually possible to 
identify a cutoff or pass/fail score that divides the 
sample into those predicted to succeed versus those 
predicted to fail on the criterion of interest. Let 
us assume that persons predicted to succeed are 


also selected for hiring or admission. In this case, 
the proportion of persons in the “predicted-to- 
succeed” group is referred to as the selection ratio. 
The selection ratio can vary from 0 to 1.0, depend- 
ing on the proportion of persons who are consid- 
ered good bets to succeed on the criterion measure. 

If the results of a selection test allow for the 
simple dichotomy of “predicted to succeed” versus 
“predicted to fail,” then the subsequent outcome on 
the criterion measure likewise can be split into two 
categories, namely, “did succeed” and “did fail.” 
From this perspective, every study of predictive va- 
lidity produces a two-by-two matrix, as portrayed 
in Figure 4.4. 

Certain combinations of predicted and actual 
outcomes are more likely than others. If a test has 
good predictive validity, then most persons pre- 
dicted to succeed will succeed; and most persons 
predicted to fail will fail. These are examples of 
correct predictions and serve to bolster the validity 
of a selection instrument. Outcomes in these two 
cells are referred to as hits because the test has 
made a correct prediction. 

But no selection test is a perfect predictor, so 
two other types of outcomes are also possible. 
Some persons predicted to succeed will, in fact, 
fail. These cases are referred to as false positives. 
And some persons predicted to fail would, if given 
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FIGURE 4.4 Possible Outcomes When a Selection Test 
Is Used to Predict Performance on a Criterion Measure 


the chance, succeed. These cases are referred to as 
false negatives. False positives and false negatives 
are collectively known as misses, because in both 
cases the test has made an inaccurate prediction. 
Finally, the hit rate is the proportion of cases in 
which the test accurately predicts success or fail- 
ure, that is, hit rate = (hits)/(hits + misses). 

False positives and false negatives are unavoid- 
able in the real-world use of selection tests. The 
only way to eliminate such selection errors would 
be to develop a perfect test, an instrument which 
has a validity coefficient of +1:00, signifying a per- 
fect correlation with the criterion measure. A per- 
fect test is theoretically possible, but none has yet 
been observed on this planet. Nonetheless, it is 
still important to develop selection tests with very 
high predictive validity, so as to minimize decision 
errors. 

Proponents of decision theory make two funda- 
mental assumptions about the use of selection tests: 


1. The value of various outcomes to the institution 
can be expressed in terms of a common utility 
scale. One such scale—but by no means the 
only one—is profit and loss. For example, when 
using an interest inventory to select salesper- 
sons, a corporation can anticipate profit from 
applicants correctly identified as successful, but 
will lose money when, inevitably, some of those 
selected do not sell enough even to support their 
own salary (false positives). The cost of the se- 
lection procedure must also be factored in to the 
utility scale as well. 

2. In institutional selection decisions, the most 
generally useful strategy is one that maximizes 
the average gain on the utility scale (or mini- 
mizes average loss) over many similar deci- 
sions. For example, which selection ratio 
produces the largest average gain on the utility 
scale? Maximization is thus the fundamental 
decision principle. 


The application of decision theory is much 
more complicated than illustrated here, mainly be- 
cause of the difficulty of finding a common utility 
scale for different outcomes. Consider the plight of 
the admissions officer at any large university. If the 
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selection ratio is quite strict, then most of the ad- 
mitted students will also succeed. But some: stu- 
dents not admitted might have succeeded, too, and 
their financial support to the university (tuition, 
fees) is therefore lost. However, if the selection 
ratio is too lenient, then the percentage of false pos- 
itives (students admitted who subsequently fail) 
skyrockets. How is the cost of a false positive to be 
calculated? The financial cost can be estimated— 
for example, advisers dedicate a certain number of 
hours atta known pay rate counseling these stu- 
dents. But no single utility scale can encompass the 
other diverse consequences such as the need for ad- 
ditional remedial services (which require money), 
the increase in faculty cynicism (an issue of 
morale), and the dashed hopes of misled students 
(whose heartbreak affects public perception of the 
university and may even influence future state 
funding!). Clearly, the neat statistical notions of de- 
cision theory oversimplify the complex influences 
that determine utility in the real world. 

Nonetheless, in large: institutional -settings 
where a common utility scale can be identified, 
principles of decision theory can be applied to 
selection problems with thought-provoking re- 
sults. For example, Schmidt, Hunter, McKenzie, 
and Muldrow (1979) analyzed the potential impact 
of using the Programmer Aptitude Test (PAT, 
Hughes & McNamara, 1959) in the selection of 
computer programmers by the federal government. 
They based their analysis on the following facts 
and assumptions: 


1. PAT scores and measures of later on-the-job 
programming performance correlate quite sub- 
stantially; the validity coefficient of the PAT is 
.76 (fact). 

2. The government hires 600 new programmers 
each year (fact). 

3. The cost of testing is about $10 per examinee 
(fact). 

4. Programmers stay on the job for about nine 
years and receive pay raises according to a 
known pay scale (fact). 

5. The yearly productivity in dollars of low- 
performing, average, and superior programmers 
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can be accurately estimated by supervisors 
(assumption), 


Based on these facts and assumptions, Schmidt 
et al. (1979) then compared the hypothetical use of 
the PAT against other selection procedures of 
lesser validity. Since the usefulness of a test is 
partly determined by the percentage of applicants 
who are selected for employment, the researchers 
also looked at the impact of different selection 
ratios on overall productivity. In each case, they 
estimated the yearly increase in dollar-amount pro- 
ductivity from using the PAT instead of an alterna- 
tive and less efficacious procedure. In general, the 
use of the PAT was estimated to increase produc- 
tivity by tens of millions of dollars. The specific es- 
timated increase depended on the selection ratio 
and the validity coefficient of hypothetical alterna- 
tive procedures (Table 4.1). For example, if 80 per- 
cent of the applicants were hired (selection ratio of 
.80), using the PAT would increase the productiv- 
ity of the federal government by at least $5.6 mil- 
lion (if the alternative procedure had a validity 


TABLE 4.1 Estimated Productivity Increase (in 

Millions of Dollars) from One Year’s Use of the 

Programmer Aptitude Test 

True Validity of Previous 
Selection Procedure 
Selection 

Ratio .00 .20 .30 40 50 
.05 97.2 PIST 58.9 46.1 33.3 
.10 82.8 60.1 50.1 39.2 28.3 

"20 66.0 48.6 40.0 31.3 22.6 
.30 54.7 40.3 33.1 25.9 18.7 
40 45.6 34.6 27.6 21.6 15.6 
‚50 37.6 27.7 22.8 17.8 12.9 
.60 30.4 22.4 18.4 14.4 10.4 
.70 23.4 17.2 14.1 11.1 8.0 
80 16.5 122 10.0 7.8 5.6 





Source: Reprinted with permission from Schmidt, F. L., Hunter, 
J. E., McKenzie, R. C., & Muldrow, T. W. (1979). Impact of 
valid selection procedures on work-force productivity. Journal 
of Applied Psychology, 64, 609-626. 


coefficient of .50) and possibly as much as $16.5 
million (if the alternative procedure had no validity 
at all). If the selection ratio were quite small, the 
use of the PAT for selection boosted. productivity 
even more—possibly as much as nearly $100 mil- 
lion. Schmidt et al. (1979) concluded that “the im- 
pact of valid selection procedures on work-force 
productivity is considerably greater than most per- 
sonnel psychologists have believed.” 

In general, large organizations that make 
dozens or hundreds of similar hiring decisions each 
year can profit from the application of decision the- 
ory to selection procedures, whereas smaller em- 
ployers may not possess a sufficient database to 
apply the complex methodologies inherent to this 
approach. When it is possible to use the maximiza- 
tion principle in conjunction with a measurable 
scale of utility, decision theory holds great promise. 
For example, Schmidt, Hunter, Outerbridge, and 
Trattner (1986) estimated that using valid measures 
of cognitive ability instead of nontest procedures 
(e.g., requiring a specific degree) for employee se- 
lection could increase the productivity of federal 
government workers by $8 billion over the course 
of a typical 13-year occupational tenure. 


Taylor-Russell Tables 


In the context of decision theory, we should de- 
scribe the statistical tables published by Taylor 
and Russell (1939) that permit a test user to deter- 
mine the expected proportion of successful ap- 
plicants selected with use of a test. For obvious 
reasons, these guides are known as Taylor-Russell 
tables. In order to use the Taylor-Russell tables, 
the tester must specify (1) the predictive validity of 
the test, (2) the selection ratio, and (3) the base 
rate for successful applicants. A change in any of 
these factors will alter the selection accuracy of 
the test. 

The predictive validity of a test is usually 
known from previous studies and consists of Tey» 
the correlation between test and criterion. The se- 
lection ratio is the proportion of applicants who are 
selected (usually because they are predicted to suc- 


ceed). The base rate is the proportion of success- 
ful applicants who would be selected using current 
methods, without benefit of the new test. (In the 
extreme case, the base rate is the proportion of suc- 
cessful applicants who would be chosen at random 
without benefit of any selection procedure.) When 
all three of these factors are known, the Taylor- 
Russell tables can be consulted to determine the 
proportion of successes expected through the ap- 
plication of the test. In this manner, the test user 
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can determine the extent to which using a new test 
would improve selection over the base rate ob- 
tained from existing methods. 

Perhaps a specific example will clarify the po- 
tential application of the Taylor-Russell tables 
(Table 4.2). Assume that the base rate for success- 
ful applicants is .60, meaning that 60 percent of ap- 
plicants accepted by current methods turn out to be 
successful. Assume also that the selection ratio is 
.50, meaning that 50 percent of all applicants are 


TABLE 4.2 The Expected Proportion of Successes for a Selection Test of Given 


Validity, with Given Selection Ratio, for Base Rate .60 


Selection Ratio 


COTTE EGR (2 2 1005 2 2.008 02.307 5 I 2007 


.00 G. 60 
.05 arae ehdit EHER VEN «7 
.10 68 67 65 64 64 ~~ «63 
al§ 71 TD EI. A T 66 
.20 Ti Sinai 69 67 


.25 Hl 73 TA .69 .68 
30 SZ 71 .69 
35 Br rend .73 71 
.40 88 85 81 Mp4 5 AA 
.45 OT er .74 


.50 1934129042 9686 tin ens FO 76 
‘55 Iren enge .78 
.60 Mr 94> pee DOr A BATO 1.834 580 


45 Digan) ER hvan98 90. -.86 
‚80 00.595 92.88 


.85 IAD Aue 
.90 1:06.00 TOD! 2599 04.97 394 
95 ROO: 1.00.7 1:005 2900s 999 7197 
1.00 1.00 1.00 1.00 1.00 1.00 1.00 


; 64 z 
69% 2 6PRTIHES HOGG? 
70 68 66 63 62 
72, .69 66. .64  .62 
73.463702 ACI 64: 4,62 
TE 6 
76.73 69 65 .63 
TETA 0er 
805i ee 
8 bested ew ad 63 
BB Brenn: u 
86 .80...73 63 


Marder 
1.00 .86 





Source: Reprinted from Taylor, H. C., & Russell, J. T. (1939). The relationship of validity coefficients 
to the practical effectiveness of tests in selection. Journal of Applied Psychology, 23, 565-578. In the 


public domain. 


108 _CHAPTER 4‘ VALIDITY AND TEST DEVELOPMENT 


selected. Further, assume that a new test has a va- 
lidity coefficient of .40, which specifies the correla- 
tion between test scores and criterion. Under these 
assumptions, the Taylor-Russell tables provide the 
expected proportion of successes through the use of 
the new test. The expected proportion of successes 
turns out to be .73, a substantial improvement over 
the existing base rate of .60 for successful selection. 

The most intriguing conclusion to emerge from 
the Taylor-Russell tables is that tests with “poor” 
validity can, nonetheless, substantially improve se- 
lection accuracy—if the selection ratio is low 
enough. Consider a test with validity of merely .20, 
which doesn’t sound very impressive; for most 
tests, validity coefficients are commonly much 
higher. Assume also that the base rate for success- 
ful selection is .60, which is probably a realistic 
base rate for many forms of personnel selection. 
Further, assume that the selection ratio is very 
stringent, say .05, meaning that only 5 percent of 
the applicants are deemed acceptable and therefore 
selected. Under these assumptions, the proportion 
of successes expected through use of the test is .75, 
a net improvement of 15 percent above the base 
rate of .60. 


| construcr VALIDITY 


The final type of validity discussed in this unit 
is construct validity, and it is undoubtedly the 
most difficult and elusive of the bunch. A con- 
struct is a theoretical, intangible quality or trait 
in which individuals differ (Messick, 1995). Ex- 
amples of constructs include leadership ability, 
overcontrolled hostility, depression, and intelli- 
gence. Notice in each of these examples that con- 
structs are inferred from behavior, but are more 
than the behavior itself. In general, constructs are 
theorized to have some form of independent exis- 
tence and to exert broad but to some extent pre- 
dictable influences on human behavior. A test 
designed to measure a construct must estimate the 
existence of an inferred, underlying characteristic 
(e.g., leadership ability) based on a limited sample 
of behavior. Construct validity refers to the appro- 


priateness of these inferences about the underlying 
construct. 

All psychological constructs possess two char- 
acteristics in common: 


1. There is no single external referent sufficient to 
validate the existence of the construct; that is, 
the construct cannot be operationally defined 
(Cronbach & Meehl, 1955). 

2. Nonetheless, a network of interlocking supposi- 
tions can be derived from existing theory about 
the construct (AERA, APA, & NCME, 1985). 


We will illustrate these points by reference to 
the construct of psychopathy (Cleckley, 1976), a 
personality constellation characterized by antiso- 
cial behavior (lying, stealing, and occasionally vio- 
lence), a lack of guilt and shame, and impulsivity.” 
Psychopathy is surely a construct, in that there is no 
single behavioral characteristic or outcome suffi- 
cient to determine who is strongly psychopathic and 
who is not. On average we might expect psycho- 
paths to be frequently incarcerated, but so are many 
common criminals. Furthermore, many successful 
psychopaths somehow avoid apprehension alto- 
gether (Cleckley, 1976). Psychopathy cannot be 
gauged only by scrapes with the law. 

Nonetheless, a network of interlocking suppo- 
sitions can be derived from existing theory about 
psychopathy. The fundamental problem in psy- 
chopathy is presumed to be a deficiency in the abil- 
ity to feel emotional arousal—whether empathy, 
guilt, fear of punishment, or anxiety under stress 
(Cleckley, 1976). A number of predictions follow 
from this appraisal. For example, psychopaths 
should lie convincingly, have a greater tolerance 
for physical pain, show less autonomic arousal in 
the resting state, and get into trouble because of 
their lack of behavioral inhibition. Thus, to vali- 
date a measure of psychopathy, we would need to 
check out a number of different expectations based 
on our theory of psychopathy. 


2. The construct of psychopathy is very similar to what is now 
designated as antisocial personality disorder (American Psychi- 
atric Association, 1994). 


Construct validity pertains to psychological 
tests that claim to measure complex, multifaceted, 
and theory-bound psychological attributes such as 
psychopathy, intelligence, leadership ability, and 
the like. The crucial point to understand about con- 
struct validity is that “no criterion or universe of 
content is accepted as entirely adequate to define 
the quality to be measured” (Cronbach & Meehl, 
1955). Thus, the demonstration of construct valid- 
ity always rests upon a program of research using 
diverse procedures outlined in the following sec- 
tions. To evaluate the construct validity of a test, 
we must amass a variety of evidence from numer- 
ous sources. 

Although the construct validation of a test is a 
lengthy and complex process, the diverse proce- 
dures are designed to answer one crucial question: 
Based on the current theoretical understanding of 
the construct that the test claims to measure, do 
we find the kinds of relationships with nontest 
criteria that the theory predicts? Consider the 
concept of psychopathy, discussed previously, as 
measured by the Psychopathic deviate (Pd) scale 
of the MMPI and MMPI-2. One small piece of 
evidence supporting the construct validity of this 
scale is the finding that hunters who had “care- 
lessly” shot someone were significantly elevated 
on Pd when compared with other hunters (Cron- 
bach & Meehl, 1955). Such a finding fits well with 
the theoretical notion of psychopathy, especially 
as regards a lack of behavioral inhibition. Of 
course, many other lines of evidence would be 
needed to confirm the construct validity of the Pd 
scale. We see, then, that the investigation of con- 
struct validity is not essentially different from the 
general scientific procedures for confirming a 
theory. 

Many psychometric theorists regard construct 
validity as the unifying concept for all types of 
validity evidence (Cronbach, 1988; Guion, 1980; 
Messick, 1995). According to this viewpoint, indi- 
vidual studies of content, concurrent, and predic- 
tive validity are regarded merely as supportive 
evidence in the cumulative quest for construct 
validation. 
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CONSTRUCT VALIDITY 


How does a test developer determine whether a 
new instrument possesses construct validity? As 
previously hinted, no single procedure will suffice 
for this difficult task. Evidence of construct valid- 
ity can be found in practically any empirical study 
that examines test scores from appropriate groups 
of subjects. Most studies of construct validity fall 
into one of the following categories: 


i APPROACHES TO 


e Analysis to determine whether the test items or 
subtests are homogeneous and therefore mea- 
sure a single construct 

Study of developmental changes to determine 
whether they are consistent with the theory of 
the construct 

Research to ascertain whether group differences 
on test scores are theory-consistent 

Analysis to determine whether intervention ef- 
fects on test scores are theory-consistent 
Correlation of the test with other related and un- 
related tests and measures 

Factor analysis of test scores in relation to other 
sources of information 


We examine these sources of construct validity ev- 
idence in more detail in the following. 


Test Homogeneity 


If a test measures a single construct, then its com- 
ponent items (or subtests) likely will be homoge- 
neous (also referred to as internally consistent). In 
most cases, homogeneity is built into the test dur- 
ing the development process discussed in more de- 
tail in the next unit. The aim of test development is 
to select items that form a homogeneous scale. 
The most commonly used method for achieving 
this goal is to correlate each potential item with the 
total score and select items that show high correla- 
tions with the total score. A related procedure is to 
correlate subtests with the total score in the early 
phases of test development. In this manner, way- 
ward scales that do not correlate to some minimum 
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degree with the total test score can be revised be- 
fore the instrument is released for general use. 
Homogeneity is an important first step in certi- 
fying the construct validity of a new test, but stand- 
ing alone it is weak evidence. Kline (1986) has 
pointed out the circularity of the procedure: - 


If all our items in the item pool were wide of the 
mark and did not measure what we hoped, they 
would be selecting items by the criterion of their 
correlation with the total score, which can never 
work. It is to be noted that the same argument ap- 
plies to the factoring of the item pool. A general 
factor of poor items is still possible. This objection 
is sound and has to be refuted empirically. Having 
found by item analysis a set of homogeneous 
items, we must still present evidence concerning 
their validity. Thus to construct a homogeneous test 
is not sufficient, validity studies must be carried 
out. 


In addition to demonstrating the homogeneity of 
items, a test developer must provide multiple other 
sources of construct validity, discussed subse- 
quently. 


Appropriate Developmental Changes 


Many constructs can be assumed to show regular 
age-graded changes from early childhood into ma- 
ture adulthood and perhaps beyond. Consider the 
construct of vocabulary knowledge as an example. 
It has been known since the inception of intelli- 
gence tests at the turn of the century that knowl- 
edge of vocabulary increases exponentially from 
early childhood into late childhood. More recent 
research demonstrates that vocabulary continues to 
grow, albeit at a slower pace, into old age (Gregory 
& Gernert, 1990). For any new test of vocabulary, 
then, an important piece of construct validity evi- 
dence would be that older subjects score better 
than younger subjects, assuming that education 
and health factors are held constant. 

Of course, not all constructs lend themselves to 
predictions about developmental changes. For ex- 
ample, it is not clear whether a scale measuring 
“assertiveness” should show a pattern of increas- 
ing, decreasing, or stable scores with advancing 


age. Developmental changes would be irrelevant to 
the construct validity of such a scale. We should 
also mention that appropriate developmental 
changes are but one piece in the construct validity 
puzzle. This approach does not provide. infor- 
mation about how the construct relates to other 


constructs. 
THEORY-CONSISTENT 
GROUP DIFFERENCES 


One way to bolster the validity of a new instrument 
is to show that, on average, persons with different 
backgrounds and characteristics obtain theory- 
consistent scores on the test. Specifically, persons 
thought to be high on the construct measured by 
the test should obtain high scores, whereas persons 
with presumably low amounts of the construct 
should obtain low scores. 

Crandall (1981) developed a social interest 
scale that illustrates the use of theory-consistent 
group differences in the process of construct vali- 
dation. Borrowing from Alfred Adler, Crandall 
(1984) defined social interest as an “interest in and 
concern for others.” To measure this construct, he 
devised a brief and simple instrument consisting of 
15 forced-choice items. For each item, one of the 
two alternatives includes a trait closely related to 
the Adlerian concept of social interest (e.g., help- 
ful), whereas the other choice consists of an 
equally attractive but nonsocial trait (e.g., quick- 
witted). The subject is instructed to “choose the 
trait which you value more highly.” Each of the 15 
items is scored 1 if the social interest trait is 
picked, 0 otherwise; thus, total scores on the Social 
Interest Scale (SIS) can range from 0 to 15. 

Table 4.3 presents average scores on the SIS 
for 13 well-defined groups of subjects. The reader 
will notice that individuals likely to be high in so- 
cial interest (e.g., nuns) obtain the highest average 
scores on the SIS, whereas the lowest scores are 
earned by presumably self-centered persons (e.g., 
models) and those who are outright antisocial 
(felons). These findings are theory-consistent and 
support the construct validity of this interesting 
instrument. 
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TABLE 4.3 Mean Scores on the Social Interest Scale for Selected Groups 





Group 





Ursuline sisters 

Adult church members 

Charity volunteers 

High school students nominated for high social interest 
University students nominated for high social interest 
University employees 

University students 

University students nominated for low social interest 
Proféssional models 

High school students nominated for low social interest 
Adult atheists and agnostics 

Convicted felons 


N Mean Score 
6 13.3 
147 11.2 
9 10.8 
23 10.2 
21 9.5 
327 8.9 
1,784 8.2 
35 7.4 
54 7.1 
22 6.9 
30 6.7 
30 6.4 





Source: Adapted with permission from Crandall, J. (1981). Theory and measurement of social interest: 
Empirical tests of Alfred Adler’s concept. New York: Columbia University Press. 


Theory-Consistent Intervention Effects 


Another approach to construct validation is to show 
that test scores change in appropriate direction and 
amount in reaction to planned or unplanned inter- 
ventions. For example, the scores of elderly persons 
on a spatial orientation test battery should increase 
after these subjects receive cognitive training specif- 
ically designed to enhance their spatial orientation 
abilities. More precisely, if the test battery possesses 
construct validity, we can predict that spatial orien- 
tation scores should show a greater increase from 
pretest to posttest than found on unrelated abilities 
not targeted for special training (e.g., inductive rea- 
soning, perceptual speed, numerical reasoning, or 
verbal reasoning). Willis and Schaie (1986) found 
just such a pattern of test results in a cognitive train- 
ing study with elderly subjects, supporting the con- 
struct validity of their spatial orientation measure. 


Convergent and Discriminant Validation 


Convergent validity is demonstrated when a test 
correlates highly with other variables or tests with 
which it shares an overlap of constructs. For exam- 
ple, two tests designed to measure different types 


of intelligence should, nonetheless, share enough 
of the general factor in intelligence to produce a 
hefty correlation (say, .5 or above) when jointly ad- 
ministered to a heterogeneous sample of subjects. 
In fact, any new test of intelligence that did not cor- 
relate at least modestly with existing measures 
would be highly suspect, on the grounds that it did 
not possess convergent validity. 

Discriminant validity is demonstrated when a 
test does not correlate with variables or tests from 
which it should differ. For example, social interest 
and intelligence are theoretically unrelated, and 
tests of these two constructs should correlate negli- 
gibly, if at all. 

In a classic paper often quoted but seldom em- 
ulated, Campbell and Fiske (1959) proposed a sys- 
tematic experimental design for simultaneously 
confirming the convergent and discriminant validi- 
ties of a psychological test. Their design is called 
the multitrait-multimethod matrix, and it calls for 
the assessment of two or more traits by two or more 
methods. Table 4.4 provides a hypothetical exam- 
ple of this approach. In this example, three traits (A, 
B, and C) are measured by three methods (1, 2, and 
3). For example, traits A, B, and C might be social 
interest, creativity, and dominance. Methods 1, 2, 
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TABLE 4.4 Hypothetical Multitrait-Multimethod Matrix 


Method 1 Method 2 Method 3 
Traits A, B, C, A, B, 6; 

Method 1 A, 89) 

B, 89) 

Cc; 76) 
Method2 A, 51°* 2 e 

B, 122 ~~. 57>~~.10 | 

Cote will ....11 3.46. 
Method3 =A; a” ai 94) 

Bc. 123. BEST 6 92) 

Cine Ul... If sas" 58 ON (85) 








Note: Letters A, B, and C refer to traits; subscripts 1, 2, and 3 refer to methods. The matrix consists of cor- 
relation coefficients (decimals omitted). See text. 


Source: Reprinted with permission from Campbell, D. T., & Fiske, D. W. (1959). Convergent and discrim- 
inant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. 


and 3 might.be self-report inventory, peer ratings, 
and projective test. Thus, Al would represent a self- 
report inventory of social interest, B2 a peer rating 
of creativity, C3 a dominance measure derived from 
projective test, and so on. 

Notice in this example that nine tests are stud- 
ied (three traits are each measured by three meth- 
ods). When each of these tests is administered twice 
to the same group of subjects and scores on all pairs 
of tests are correlated, the result is a multitrait- 
multimethod matrix (Table 4.4). This matrix is a 
rich source of data on reliability, convergent valid- 
ity, and discriminant validity. 

For example, the correlations along the main 
diagonal (in parentheses) are reliability coeffi- 
cients for each test. The higher these values, the 
better, and preferably we like to see values in the 
.80s or .90s here. The correlations along the three 
shorter diagonals (in boldface) supply evidence of 
convergent validity—the same trait measured by 
different methods. These correlations should be 
strong and positive, as shown here. Notice that the 
table also includes correlations between different 
traits measured by the same method (in solid tri- 
angles) and different traits measured by different 
methods (in dotted triangles). These correlations 


should be the lowest of all in the matrix, insofar as 
they supply evidence of discriminant validity. 

The Campbell and Fiske (1959) methodology is 
an important contribution to our understanding of 
the test validation process. However, the full im- 
plementation of this procedure typically requires 
too monumental a commitment from researchers. 
It is more common for test developers to collect 
convergent and discriminant validity data in bits 
and pieces, rather than producing an entire matrix 
of intercorrelations. Meier (1984) provides one of 
the few real-world implementations of the multi- 
trait-multimethod matrix in an examination of the 
validity of the “burnout” construct. 


Factor Analysis 


Factor analysis is a specialized statistical tech- 
nique that is particularly useful for investigating 
construct validity. We discuss factor analysis in 
substantial detail in Topic 8A, Aptitude Tests and 
Factor Analysis; here, we provide a quick preview 
so that the reader can appreciate the role of factor 
analysis in the study of construct validity. The pur- 
pose of factor analysis is to identify the minimum 
number of determiners (factors) required to ac- 


count for the intercorrelations among a battery of 
tests. The goal in factor analysis is to find a smaller 
set of dimensions, called factors, that can account 
for the observed array of intercorrelations among 
individual tests. A typical approach in factor analy- 
sis is to administer a battery of tests to several hun- 
dred subjects and then calculate a correlation 
matrix from the scores on all possible pairs of tests. 
For example, if 15 tests have been administered to 
a sample of psychiatric and neurological patients, 
the first step in factor analysis is to compute the 
correlations between scores on the 105 possible 
pairs of tests. Although it may be feasible to see 
certain clusterings of tests that measure common 
traits, it is more typical that the mass of data found 
in a correlation matrix is simply too complex for 
the unaided human eye to analyze effectively. For- 
tunately, the computer-implemented procedures of 
factor analysis search this pattern of intercorrela- 
tions, identify a small number of factors, and then 
produce a table of factor loadings. A factor load- 
ing is actually a correlation between an individual 
test and a single factor. Thus, factor loadings can 
vary between —1.0 and +1.0. The final outcome of 
a factor analysis is a table depicting the correlation 
of each test with each factor. 

A table of factor loadings helps describe the 
factorial composition of a test and thereby provides 
information relevant to construct validity. We will 
illustrate this point with factor analytic data from a 
study of the Category Test. The Category Test is a 
relatively complex concept formation test designed 
to be different from traditional psychometric mea- 
sures of intelligence and superior to them at detect- 
ing neurological disorders (Reitan & Wolfson, 
1985). If the Category Test does, indeed, measure 
something different from traditional tests of intelli- 
gence, then it should load strongly on one or more 
factors not represented by the subtests of the WAIS. 
Such a finding would strengthen the construct va- 
lidity of the Category Test by distinguishing it from 
traditional measures of intelligence. 


3. The general formula for the number of pairings among N 
tests is N(N — 1)/2. Thus, if 15 tests are administered, there will 
be 15 x 14/2 or 105 possible pairings of individual tests. 
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TABLE 4.5 Factor Loadings for the Category 
Test, Finger Tapping Test, and WAIS Subtests 


Factor Loading 
Test I II II IV 
Information .88 „15 .07 .07 
Comprehension .83 —.03 .06  —.09 
Arithmetic 43 .26 67 —-12 
Similarities .78 .30 17 .02 
Digit Span 23 .08 ‚83 12 
Vocabulary 92 .07 .06 01 
Digit Symbol 23 31 PA! .61 
Picture Completion .64 50 -24  -01 
Block Design .39 .74 .06 .20 
Picture Arrangement 50 .60 12 -01 
Object Assembly .29 .73 .00 31 
Category Test 19 82 fiii —.18 


Finger Tapping Test .07 —.08 .18 .76 





Source: Adapted with permission from Lansdell, H., & Donnelly, 
E. F. (1977). Factor analysis of the Wechsler Adult Intelligence 
Scale Subtests and the Halstead-Reitan Category and Tapping 
Tests. Journal of Consulting and Clinical Psychology, 45, 
412-416. 


Lansdell and Donnelly (1977) administered the 
11 subtests of the Wechsler Adult Intelligence Scale, 
the Category Test, and the Finger Tapping Test to 
94 psychiatric and neurological patients. The test 
scores were factor analyzed, producing the factor 
loadings shown in Table 4.5. Notice that the verbal 
subtests from the WAIS have the highest loadings on 
factor I, which is surely a factor of verbal compre- 
hension. The Category Test has a minimal loading 
on this factor, indicating that verbal abilities are not 
particularly important for good performance on this 
test. Factor II has its strongest loadings on Block 
Design (.74) and Object Assembly (.73) and is typi- 
cally labeled a perceptual organization factor.* 
Unfortunately, the Category Test has a substantial 
loading (.82) on this factor and this factor alone. At 


4, Notice that humans provide the label for a factor based on an 
analysis of the tests that load most strongly on it. Two investi- 
gators might conceivably use different names for the same fac- 
tor—for example, referring to factor II as either perceptual 
organization or visuospatial analysis. 
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least for this sample of subjects, it appears that the 
Category Test is merely an alternative measure of 
perceptual organizational skills and not a new and 
different test, as many of its users might like to 


claim. Incidentally, factor III seems to measure free- - 


dom from distractibility, and factor TV appears to be 


a pure measure of motor speed. 
EXTRAVALIDITY CONCERNS AND THE 
WIDENING SCOPE OF TEST VALIDITY 


We begin this section with a review of extravalidity 
concerns, which include side effects and unin- 
tended consequences of testing. By acknowledging 
the importance of the extravalidity domain, psy- 
chologists confirm that the decision to use a test in- 
volves social, legal, and political considerations that 
extend far beyond the traditional questions of tech- 
nical validity. In a related development, we will also 
review how the interest in extravalidity concerns has 
spurred several theorists to broaden the concept of 
test validity. As the reader will discover, value im- 
plications and social consequences are now encom- 
passed within the widening scope of test validity. 

Even if a test is valid, unbiased, and fair, the 
decision to use it may be governed by additional 
considerations. Cole and Moss (1989) outline the 
following factors: 





¢ What is the purpose for which the test is used? 

« To what extent are the purposes accomplished 
by the actions taken? 

« What are the possible side effects or unintended 
consequences of using the test? 

+ What possible alternatives to the test might serve 
the same purpose? 


We survey only the most prominent extravalidity 
concerns here and show how they have served to 
widen the scope of test validity. Readers who wish 
more detail on these topics should consult AERA, 
APA, NCME (1999), Cole and Moss (1998), Cron- 
bach (1988), and Jensen (1980, chap. 15). 


Unintended Side Effects of Testing 


The intended outcome of using a psychological 
test is not necessarily the only consequence. Vari- 


ous side effects also are possible, indeed, they are 
probable. The examiner must determine whether 
the benefits of giving the test outweigh the costs of 
the potential side effects. Furthermore, by antici- 
pating unintended side effects, the examiner might 
be able to deflect or diminish them. 

Cole and Moss (1998) cite the example of 
using psychological tests to determine eligibility 
for special education. Although the intended out- 
come is to help students learn, the process of iden- 
tifying students eligible for special education may 
produce numerous negative side effects: 


The identified children may feel unusual or 
dumb. 

Other children may call the children names. 
Teachers may view these children as unworthy 
of attention. 

The process may produce classes segregated by 
race or social class. 


A consideration of side effects should influence an 
examiner’s decision to use a particular test for a 
specified purpose. The examiner might appropri- 
ately choose not to use a test for a worthy purpose 
if the likely costs from side effects outweigh the 
expected benefits. i 

Consider the common practice in years past of 
using the Minnesota Multiphasic Personality In- 
ventory (MMPI) to help screen candidates- for 
peace officer positions such as police officer or 
sheriff’s deputy. Although the MMPI was origi- 
nally designed as an aid in psychiatric diagnosis, 
subsequent research indicated that it is also use- 
ful in the identification of persons unsuited to a ca- 
reer in law enforcement (Hargrave, 1985; Hiatt & 
Hargrave, 1988). In particular, peace officers who 
produce MMPI profiles with mild elevations 
(e.g., T score 65 to 69) on Scales F (Frequency), 
Masculinity-Femininity, Paranoia, and Hypo- 
mania tend to be involved in serious disciplinary 
actions; peace officers who produce more “defen- 
sive” MMPI profiles with fewer clinical scale ele- 
vations tend not to be involved in such actions. 
Thus, the test possessed modest validity for the 
worthy purpose of screening law enforcement 
candidates. But no test, not even the highly re- 
spected MMPI, is perfectly valid. Some good ap- 


plicants will be passed over because their MMPI 
results are marginal. Perhaps their Paranoia Scale 
is at T score of 66, or the Hypomania Scale is 
at T score of 68. On the MMPI, a T score of 70 is 
often considered the upper limit of the “normal” 
range. 

One unintended side effect of usingthe MMPI 
for evaluation of peace officer applicants is that 
job candidates who are unsuccessful with one 
agency may be tagged with a pathological label 
such as psychopathic, schizophrenic, or paranoid. 
The label may arise in spite of the best efforts of 
the consulting psychologist, who may never have 
used any pejorative terms in the assessment report 
on the candidate. Typically, the label is conceived 
when administrators at the referring department 
look at the MMPI profile and see that the can- 
didate obtained his/her highest score on a scale 
with a horrendous title such as. Psychopathic 
Deviate, Schizophrenia, Hypochondriasis, or 
Paranoia. Unfortunately, the law enforcement 
community can be a very closed fraternity. Police 
chiefs and sheriffs commonly exchange verbal re- 
ports about their job applicants, so a pejorative 
label may follow the candidate from one setting to 
another, permanently barring the applicant from 
entry into the law enforcement profession. The 
repercussions are not only unfair to the candidate, 
they also raise the specter of lawsuits against 
the agency and the consulting psychologist. All 
things considered, the consulting psychologist 
may find it preferable to use a technically less 
valid test for the same purpose, particularly if the 
alternative instrument does not produce these un- 
intended side effects. 

The renewed sensitivity to extravalidity issues 
has caused several test theorists to widen their 
definition of test validity. We review these recent 
developments in the following section, cautioning 
the reader that a final consensus about the nature of 
test validity is yet to emerge. 


The Widening Scope of Test Validity 


By now the reader is familiar with the narrow, 
traditionalist perspective on test use, which states 
that a test is valid if it measures “what it purports 
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to measure.’ The implicit implication of this 
perspective is that technical validity is the most 
essential basis for recommending test use. After 
all, valid tests provide accurate information 
about examinees—and what could be wrong with 
that? 

Recently, several psychometric theoreticians 
have introduced a wider, functionalist definition 
of validity that asserts that a test is valid if it serves 
the purpose for which it is used (Cronbach, 1988; 
Messick, 1995). For example, a reading achieve- 
ment test might be used to identify students for as- 
signment to a remedial section: According: to 
the functionalist perspective, the test would be 
valid—and its use therefore appropriate—if the 
students selected for remediation actually _ re- 
ceived some academic benefit from. this applica- 
tion of the test. 

The functionalist perspective explicitly recog- 
nizes that the test validator has an obligation to 
determine whether a practice has constructive con- 
sequences for individuals and institutions, and 
especially to guard against adverse outcomes 
(Messick, 1980). Test validity, then, is an overall 
evaluative judgment of the adequacy and appropri- 
ateness of inferences and actions that flow from 
test scores. 

Messick (1980, 1995) argues that the new, 
wider conception of validity rests on four bases. 
These are (1) traditional evidence of construct 
validity, for example, appropriate convergent and 
discriminant validity, (2) an analysis of the value 
implications of the test interpretation, (3) evi- 
dence for the usefulness of test interpretations 
in particular applications, and (4) an appraisal of 
the potential and actual social consequences, in- 
cluding ‚side effects, from test use. A valid test 
is one that answers well to all four facets of test 
validity. 

This wider conception of test validity is admit- 
tedly controversial, and some theorists prefer the 
traditional view that consequences and values are 
important but nonetheless separate from the tech- 
nical issues of test validity. Everyone can agree on 
one point: Psychological measurement is not a 
neutral endeavor, it is an applied science that oc- 
curs in a social and political context. 
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SUMMARY 


1. The validity of a test is the degree to which 
it measures what it claims to measure. A test is 
valid to the extent that inferences made from it 
are appropriate, meaningful, and useful. Reliabil- 
ity is a necessary but not a sufficient precursor to 
validity. 

2.. Traditionally, the different ways of accu- 
mulating validity evidence have been grouped into 
three categories: content validity, criterion-related 
validity, and construct validity. However, validity 
is a unitary concept and any empirical study may 
bear upon the validity of a test. 


3. Content validity is determined by the de- 
gree to which the questions, tasks, or items on a 
test are representative of the universe of behavior 
the test was designed to sample. Content validity is 
easy to assure for well-defined traits such as 
spelling ability, but more difficult to specify for in- 
explicit traits such as anxiety. 


4. A test has face validity if it looks valid to 
test users, examiners, and especially the exami- 
nees. Face validity is important for social accept- 
ability of a test but is irrelevant for psychometric 
purposes. 

5. Criterion-related validity is demonstrated 
when a test is effective in predicting performance 
on an appropriate outcome measure. Criterion- 
related validity subsumes concurrent validity, in 
which the criterion measures are obtained at ap- 
proximately the same time as the predictor test 
scores, and predictive validity, in which the crite- 
rion measures are obtained in the future. 


6. When tests are used for purposes of pre- 
diction, it is necessary to develop a regression 
equation. A regression equation describes the best- 
fitting straight line (one that minimizes the sum of 
the squared deviations from the line) for estimating 
the criterion from the test. For example, the equa- 
tion Y = .07X + .2 might be used to predict job rat- 
ings from an employment test. 


7. The correlation between test and criterion 
(ry) is known as a validity coefficient. The higher 
the correlation, the more accurate is the test in es- 
timating the criterion. 


8. The standard error of estimate (SE,,,) is 
the margin of error to be expected in the predicted 
criterion score. The error of estimate is derived 
from the following formula: 


SE... = SD,V1 - pe: 


where r,, is the validity coefficient. 


9. Proponents of decision theory stress that a 
test must aid in accurate decision making. The ac- 
curate prediction of success versus failure on an 
outcome measure is essential. Tests should avoid 
two kinds of errors: false positives, in which sub- 
jects are predicted to succeed but fail, and false 
negatives, in which subjects are predicted to fail 
but succeed. 


10. Decision theory assumes that the costs of 
accurate and inaccurate predictions can be mea- 
sured on a common utility scale such as profit/loss. 
A fundamental assumption of decision theory is 
maximization: In institutional selection decisions, 
the most appropriate test-use strategy is one that 
maximizes the average gain or minimizes the aver- 
age loss. 


11. A construct is a theoretical, intangible 
quality or trait in which individuals differ. Con- 
struct validity pertains to psychological tests that 
claim to measure complex, multifaceted, and 
theory-bound psychological attributes such as 
leadership ability, overcontrolled hostility, and 
intelligence. 


12. Studies of construct validity generally fall 
into one of these categories: analysis of item ho- 
mogeneity; assessment of developmental and 
group changes on the test; analysis of intervention 
effects; correlation and factor analysis of test 
scores in relation to other sources of information. 


In each case, the crucial question is whether the re- 
sults are consistent with the underlying theory of 
the construct that is measured. 


13. Extravalidity concerns include the side ef- 
fects and unintended consequences of testing. For 
example, a valid assessment for special education 
placement may nonetheless cause identified chil- 
dren to feel unusual or dumb. A consideration of 
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side effects should influence an examiner’s deci- 
sion to use a particular test for a specific purpose. 


14. The new, wider, functionalist perspective 
on test validity asserts that a test is valid if it serves 
the purposes for which it is used. For example, the 
validity of a reading achievement test might be 
linked to successful remediation of the reading- 
impaired students identified by the test. 
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Publishing the Test 
Summary 

Key Terms and Concepts 


reating a new test involves both science and 
art. A test developer must choose strategies 
and materials, and then make day-to-day research 
decisions that will affect the quality of his or her 
emerging instrument. The purpose of this section is 
to discuss the process by which psychometricians 
create valid tests. Although we will discuss many 
separate topics, they are united by a common 
theme: Valid tests do not just materialize upon the 
scene in full maturity—they emerge slowly from 
an evolutionary, developmental process that builds 
in validity from the very beginning. We will em- 
phasize the basics of test development here; read- 
ers who desire a more advanced presentation 
should consult Kline (1986), McDonald (1999), 
and Bernstein and Nunnally (1994). 
Test construction consists of six intertwined 
stages: 


Defining the test Testing the items 
Selecting a scaling method Revising the test 
Constructing the items Publishing the test 


By way of preview, we can summarize these steps as 
follows: defining the test consists of delimiting its 
scope and purpose, which must be known before the 
developer can proceed to test construction. Select- 
ing a scaling method is a process of setting the rules 
by which numbers are assigned to test results. Con- 
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structing the items is as much art as science, and it is 
here that the creativity of the test developer may be 
required. Once a preliminary version of the test is 
available, the developer usually administers it to a 
modest-sized sample of subjects in order to collect 
initial data about test item characteristics. Testing 
the items entails a variety of statistical procedures 
referred to collectively as item analysis. The pur- 
pose of item analysis is to determine which items 
should be retained, which revised, and which 
thrown out. Based on item analysis and other 
sources of information, the test is then revised. If the 
revisions are substantial, new items and additional 
pretesting with new subjects may be required. Thus, 
test construction involves a feedback loop whereby 
second, third, and fourth drafts of an instrument 
might be produced (Figure 4.5). Publishing the test 
is the final step. In addition to releasing the test ma- 
terials, the developer must produce a user-friendly 
test manual. Let us examine each of these steps in 
more detail. 


| DEFINING THE TEST 


In order to construct a new test, the developer must 
have a clear idea of what the test is to measure and 
how it is to differ from existing instruments. Inso- 
far as psychological testing is now entering its sec- 





Defining the Test 
Selecting a Scaling Method 
Constructing the Items 
Testing the Items 


Revising the Test 


Publishing the Test 





FIGURE 4.5 The Test Construction Process 


ond one hundred years, and insofar as thousands of 
tests have already been published, the burden of 
proof clearly rests upon the test developer to show 
that a proposed instrument is different from, and 
better than, existing measures. 

Consider the daunting task faced by a test de- 
veloper who proposes yet another measure of gen- 
eral intelligence. With dozens of such instruments 
already in existence, how could a new test possibly 
make a useful contribution to the field? The answer 
is that contemporary research continually adds to 
our understanding of intelligence and impels us to 
seek new and more useful ways to measure this 
multifaceted construct. 

Kaufman and Kaufman (1983) provide a good 
model of the test definition process. In proposing the 
Kaufman Assessment Battery for Children (K-ABC), 
a new test of general intelligence in children, the au- 
thors listed six primary goals that define the purpose 
of the test and distinguish it from existing measures: 


1. Measure intelligence from a strong theoretical 
and research basis 

2. Separate acquired factual knowledge from the 
ability to solve unfamiliar problems 

3. Yield scores that translate to educational inter- 
vention 
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4. Include novel tasks 

5. Be easy to administer and objective to score 

6. Be sensitive to the diverse needs of preschool, 
minority group, and exceptional children 
(Kaufman & Kaufman, 1983) 


As the reader will discover in a later topic, the 
K-ABC represents an interesting departure from 
traditional intelligence tests. For now, the impor- 
tant point is that the developers of this recent in- 
strument explained its purpose explicitly and 
proposed a fresh focus for measuring intelligence, 
long before they started constructing test items. 


| SELECTING A SCALING METHOD 


The immediate purpose of psychological testing 
is to assign numbers to responses on a test so 
that the examinee can be judged to have more or 
less of the characteristic measured. The rules by 
which numbers are assigned to responses define 
the scaling method. Test developers select a scaling 
method that is optimally suited to the manner in 
which they have conceptualized the trait(s) mea- 
sured by their test. No single scaling method is 
uniformly better than the others. For some traits, 
ordinal ranking of expert judges might be the best 
measurement approach; for other traits, complex 
scaling of self-report data might yield the most 
valid measurements. 

There are so many distinctive scaling methods 
available to psychometricians that we will be satis- 
fied to provide only a representative sample here. 
Readers who wish a more thorough and detailed 
review should consult Gulliksen (1950), Nunnally 
(1978), or Kline (1986). However, before review- 
ing selecting scaling methods, we need to intro- 
duce a.related concept, levels of measurement, so 
that the reader can better appreciate the differences 
between scaling methods. 


Levels of Measurement 


According to Stevens (1946), all numbers derived 
from measurement instruments of any kind can be 
placed into one of four hierarchical categories: 
nominal, ordinal, interval, or ratio. Each category 
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defines a level of measurement; the order listed is 
from least to most informative. 

In a nominal scale, the numbers serve only as 
category names. For example, when collecting 
data for a demographic study, a researcher might 
code males as “1” and females as “2.” Notice that 
the numbers are arbitrary and do not designate 
“more” or “less” of anything. In nominal scales the 
numbers are just a simplified form of naming. 

An ordinal scale constitutes a form of ordering 
or ranking. If college professors were asked to rank 
order four cars as to which they would prefer to 
own, the preferred order might be “1” Cadillac, “2” 
Chevrolet, “3” Volkswagen, “4” Hyundai. Notice 
here that the numbers are not interchangeable. A 
ranking of “1” is “more” than a ranking of “2,” and 
so on. The “more” refers to the order of preference. 
However, ordinal scales fail to provide information 
about the relative strength of rankings. In this hy- 
pothetical example, we do not know whether col- 
lege professors strongly prefer Cadillacs over 
Chevrolets or just marginally so. 

An interval scale provides information about 
ranking, but also supplies a metric for gauging the 
differences between rankings. To construct an in- 
terval scale, we might ask our college professors to 
rate on a scale from | to 100 how much they would 
like to own the four cars previously listed. Suppose 
the average ratings work out as follows: Cadillac, 
90; Chevrolet, 70; Volkswagen, 60; Hyundai, 50. 
From this information we could infer that the pref- 
erence for a Cadillac is much stronger than for a 
Chevrolet, which, in turn, is mildly stronger than 
the preference for a Volkswagen. More important, 
we can also make the assumption that the intervals 
between the points on this scale are approximately 
the same: The difference between professors’ pref- 
erence for a Chevrolet and Volkswagen (10 points) 
is about the same as that between a Volkswagen 
and a Hyundai (also 10 points). In short, interval 
scales are based on the assumption of equal-sized 
units or intervals for the underlying scale. 

_ A ratio scale has all the characteristics of an 
interval scale, but also possesses a conceptually 
meaningful zero point in which there is a total ab- 
sence of the characteristic being measured. The 





Characteristics 
Allows for Allows Uses Possesses 
Categoriz- for Equal Real Zero 
Level ing Ranking Intervals Point 
Nominal x 
Ordinal x x 
Interval x x x 
Ratio x x x x 





FIGURE 4.6 Essential Characteristics of Four Levels of 
Measurement 


essential characteristics of the four levels of mea- 
surement are summarized in Figure 4.6. 

Ratio scales are rare in psychological measure- 
ment. Consider whether there is any meaningful 
sense in which a person can be thought to have 
zero intelligence. Not really. The same is true for 
most constructs in psychology: Meaningful zero 
points just do not exist. However, a few physical 
measures used by psychologists qualify as ratio 
scales. For example, height and weight qualify, and 
perhaps some physiological measures such as elec- 
trodermal response qualify, too. But by and large 
the best a psychologist can hope for is interval- 
level measurement. 

Levels of measurement are relevant to test con- 
struction because the more powerful and useful 
parametric statistical procedures (e.g., Pearson 7, 
analysis of variance, multiple regression) should 
be used only for scores derived from measures that 
meet the criteria of interval or ratio scales. For 
scales that are only nominal or ordinal, less- 
powerful nonparametric statistical procedures (e.g., 
chi-square, rank order correlation, median tests) 
must be employed. In practice, most major psy- 
chological testing instruments (especially intelli- 
gence tests and personality scales) are assumed to 
employ approximately interval-level measurement 
even though, strictly speaking, it is very difficult to 
demonstrate absolute equality of intervals for such 
instruments (Bausell, 1986). Now that the reader is 
familiar with levels of measurement, we introduce 
a representative sample of scaling methods, noting 





in advance that different scaling methods yield dif- 
ferent levels of measurement. 
REPRESENTATIVE SCALING 
METHODS 
Expert Rankings 








Suppose we wanted to measure the depth of coma 
in patients who had suffered a recent head in- 
jury that rendered them unconscious. A depth of 
coma scale could be very important in predicting 
the course of improvement, because it is well 
known that a lengthy period of unconsciousness 
offers a poor prognosis for ultimate recovery. In 
addition, rehabilitation personnel have a practical 
need to know whether a patient is deeply coma- 
tose or in a partially communicative state of twi- 
light consciousness. 

One approach to scaling the depth of coma 
would be to rely on the behavioral rankings of ex- 
perts. For example, we could ask a panel of neu- 
rologists to list patient behaviors associated with 
different levels of consciousness. After the experts 
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had submitted a large list of diagnostic behaviors, 
the test developers—preferably experts on head in- 
juries—could rank the indicator behaviors along a 
continuum of consciousness ranging from deep 
coma to basic orientation. Using precisely this ap- 
proach, Teasdale and Jennett (1974) produced the 
Glasgow Coma Scale. Instruments similar to this 
scale are widely used in hospitals for the assess- 
ment of traumatic brain injury (Figure 4.7). 

The Glasgow Coma Scale is scored by observ- 
ing the patient and assigning the highest level of 
functioning on each of three subscales. On each 
subscale, it is assumed that the patient displays all 
levels of behavior below the rated level. Thus, from 
a psychometric standpoint, this scale consists of 
three subscales (eyes, verbal response, and motor 
response) each yielding an ordinal ranking of 
behavior. 

In addition to the rankings, it is possible to 
compute a single overall score that is something 
more than an ordinal scale, although probably less 
than true interval-level measurement. If numbers 
are attached to the rankings (e.g., for eyes open a 
coding of “none” = 1, “to pain” = 2, and so on), 





Cc 
Eyes 4 Spontaneously 
O Open 3 Tospeech 
2 Topain 
M 1 None 
A 
Best 5 Oriented 
Verbal 4 Confused 
Response 3 Inappropriate 
2 Incomprehensible 
1 None 
Cc 
Best 5 Obey commands 
A Motor 4 Localize pain 
Response 3 Flexion to pain 
2 Extension to pain 
1 None 
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FIGURE 4.7 

Example of the Use 

of Glasgow Coma Scale 
for Recording Depth 

of Coma 

Source: Reprinted with per- 
mission from Jennett, B., 
Teasdale, G. M., & Knill- 
Jones, R. P. (1975). Predict- 
ing outcome after head 
injury. Journal of the Royal 
College of Physicians of 
London, 9, 231-237. 
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then the numbers for the rated level for each sub- 
scale can be added, yielding a maximum possible 
score of 14. The total score on the Glasgow Coma 
Scale predicts later recovery with a very high de- 
gree of accuracy (Jennett, Teasdale, & Knill-Jones, 
1975). We see, then, that quite plain psychological 
tests derived from the very simplest scaling meth- 
ods can, nonetheless, provide valid and useful 
information. 


Method of Equal-Appearing Intervals 


Early in this century, L. L. Thurstone (1929) pro- 
posed a method for constructing interval-level 
scales from attitude statements. His method of 
equal-appearing intervals is still used today, 
marking him as one of the giants of psychometric 
theory. The actual methodology of constructing 
equal-appearing intervals is somewhat complex 
and statistically laden, but the underlying logic is 
easy to explain (Ghiselli; Campbell, & Zedeck, 
1981). To illustrate the method, we summarize the 
steps involved in constructing a scale of attitudes 
toward church membership: 


1. Collect as many true-false statements as possi- 
ble reflecting a variety of positive and negative 
attitudes toward the church. Two extreme ex- 
amples might be: 


“I feel that church services give me inspiration and 
help me to live up to my best during the following 
week.” 

“I think that churches seek to impose a lot of worn- 
out dogmas and medieval superstitions.” 


Of course, many moderate items would be col- 
lected as well. 

2. Next, have 10 or so known experts or judges 
rate these statements to determine the degree of 
favorability/unfavorability toward the attitude. 
The judges should be qualified for the task at 
hand; ministers might be used for a church 
membership attitude scale. Usually, each judge 
is requested to sort each statement into 1 of 11 
categories that range from “extremely favor- 
able” to “extremely unfavorable.” Judges are 


told to disregard their own biases and to regard 
the 11 categories as equidistant. 

3. After the judges have completed the evaluation 
process, the mean favorability rating (from 1 to 
11) and standard deviation for each item is de- 
termined. For example, 10 judges may have 
given an average favorability rating of 9.2 to the 
first item previously noted; but ratings would 
likely differ from one judge to another, as re- 
flected in a standard deviation of 1.1 for this 
item. 

4. Because the standard deviation of an item fa- 
vorability rating reflects ambiguity, items with 
large standard deviations are therefore dropped. 
Usually, about 20 to 30 items are chosen such 
that the statements cover the range of the di- 
mension (favorable to unfavorable). It is as- 
sumed that the differences between items on the 
final scale fulfill the properties of an interval 
scale. 

5. Persons who take the attitude scale are asked to 
mark all the statements with which they agree. 
Their score is determined by averaging the scale 
values of those items endorsed. 


~ Ghiselli et al. (1981) note that the preceding scaling 


method merely produces the attitude scale. Reliabil- 
ity and validity analyses of the scale are still needed 
to determine its appropriateness and usefulness. 

„A study by Russo (1994) illustrates a modern 
application of the Thurstone method. She used a 
Thurstone scaling approach to evaluate 216 items 
from three prominent self-report depression inven- 
tories. The judges included 527 undergraduates 
and 37 clinical faculty members at a medical 
school. The 216 items were randomized and rated 
with respect to depressive severity from 1 repre- 
senting no depression to 11 representing extreme 
depression. She discovered that all three self- 
report inventories lacked items and response op- 
tions typical of mild depression. The distribution 
of the 216 items was bimodal with many items 
bunched near the bottom (no depression) and many 
items bunched near the middle (moderate depres- 
sion). A characteristic finding for one set of items 
from a prominent depression scale was as follows: 


Rated Original Item Content 
Depression Scoring 

1.0 1 I never feel downhearted 
or sad. 

3.4 2 I sometimes feel down- 
hearted or sad. 

4.1 I feel downhearted or sad a 
good part of the time. 

4.4 4 I feel downhearted or sad 


most of the time. 


The.reader will notice that the original scoring on 
these items deviates substantially from the depres- 
sion ratings provided by the panel of students and 
clinical faculty. It is also evident that the actual 
scale values are discontinuous, jumping from 1.0 
to 3.4 and higher. A similar pattern was observed 
for many items on all three inventories, leading 
Russo (1994) to conclude: 


The present results suggest that if the original scor- 
ing is used for the three scales examined here, then 
the distinctions between well-being and absence of 
depression as well as between moderate and severe 
will be difficult to make. Such imprecision will 
make it difficult to assess the efficacy of treatments 
for depression, because a lack thereof must be a 
function of added measurement error due to ordinal 
measures. Such error could also wreak havoc in 
longitudinal studies, especially in those in which 
memory is involved. 


We see in this example that Thurstone’s approach 
to item scaling has powerful applications in test de- 
velopment. Based upon these findings, researchers 
are now in a position to develop improved self- 
report scales that assess the full range of sympto- 
matology in depression. 


Method of Absolute Scaling 


Thurstone (1925) also developed the method of 
absolute scaling, a procedure for obtaining a mea- 
sure of absolute item difficulty based upon results 
for different age groups of test takers. The method- 
ology for determining individual item difficulty on 
an absolute scale is quite complex, although the 
underlying rationale is not too difficult to under- 
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stand. Essentially, a set of common test items is ad- 
ministered to two or more age groups. The relative 
difficulty of these items between any two age 
groups serves as the basis for making a series of in- 
terlocking comparisons for all items and all age 
groups. One age group serves'as the anchor group. 
Item difficulty is measured in common units such 
as standard deviation units of ability for the anchor 
group. The method of absolute scaling is widely 
used in group achievement and aptitude: testing 
(STEP, 1980; Donlon, 1984). 

Thurstone (1925) illustrated the method of ab- 
solute scaling with data from the testing of 3,000 
schoolchildren on the 65 questions from the origi- 
nal Binet test. Using the mean of Binet test intelli- 
gence of 34-year-old children as the zero point 
and the standard deviation of their intelligence as 
the unit of measurement, he constructed a scale 
that ranged from —2 to +10 and then located each 
of the 65 questions on that scale. Thurstone (1925) 
found that the scale “brings out rather strikingly 
the fact that the questions are unduly bunched at 
certain ranges [of difficulty] and rather scarce at 
other ranges.” A modern test developer would use 
this kind of analysis as a basis for dropping redun- 
dant test items (redundant in the sense that they 
measure at the same difficulty level) and adding 
other items that test the higher (and lower) ranges 
of difficulty. 


Likert Scales 


Likert (1932) proposed a simple and straightfor- 
ward method for scaling attitudes that is widely 
used today. A Likert scale presents the examinee 
with five responses ordered on an agree/disagree or 
approve/disapprove continuum. For example, one 
item on a scale to assess attitudes toward church 
membership might read: 


Church services give me inspiration and help me to live 
up to my best during the following week. 
Do you: 


I I II I I 
Strongly Agree Undecided _ Disagree. Strongly 
Agree Disagree 
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Depending on the wording of an individual item, 
an-extreme answer of “strongly agree” or “strongly 
disagree” will indicate the most favorable response 
on the underlying attitude measured by the ques- 
tionnaire. Likert (1932) assigned a score of 5 to this 
extreme response, | to the opposite extreme, and 2, 
3, and 4 to intermediate replies. The total scale 
score is obtained by adding the scores from indi- 
vidual items. For this reason, a Likert scale is also 
referred to as a summative scale. 


Guttman Scales 


On a Guttman scale, respondents who endorse one 
statement also agree with milder statements pertinent 
to the same underlying continuum (Guttman, 1944, 
1947). Thus, if the examiner knows an examinee’s 
most extreme endorsement on the continuum, it is 
possible to reconstruct the intermediate responses as 
well. Guttman scales are produced by selecting items 
that fall into an ordered sequence of examinee en- 
dorsement. A perfect Guttman scale is seldom 
achieved because of errors of measurement, but is 
nonetheless a fitting goal for certain types of tests. 
Although the Guttman approach was originally 
devised to determine whether a set of attitude state- 
ments is unidimensional, the technique has been 
used in many different kinds of tests. For example, 
Beck used Guttman-type scaling to produce the in- 
dividual items of the Beck Depression Inventory 
(BDI, Beck, Steer, & Garbin, 1988; Beck et al., 
1961). Items from the BDI resemble the following: 


( ) Ioccasionally feel sad or blue. 

(__) I often feel sad or blue. 

( ) I feel sad or blue most of the time. 

( ) I always feel sad and I can’t stand it. 


Clients are asked to “check the statement from 
each group that you feel is most true about you.” 
A client who endorses an extreme alternative (e.g., 
“I always feel sad and I can’t stand it”) almost 
certainly agrees with the milder statements as well. 


Method of Empirical Keying 


The reader may have noticed that most of the scal- 
ing methods discussed in the preceding section rely 


upon the authoritative judgment of experts in the 
selection and ordering of items. It is also possible 
to construct measurement scales based entirely on 
empirical considerations devoid of theory or expert 
judgment. In the method of empirical keying, test 
items are selected for a scale based entirely on how 
well they contrast a criterion group from a norma- 
tive sample. For example, a Depression scale could 
be derived from a pool of true-false personality in- 
ventory questions in the following manner: 


1. Acarefully selected and homogeneous group of 
persons experiencing major depression is gath- 
ered to answer the pool of true-false questions. 

2. For each item, the endorsement frequency of 
the depression group is compared to the en- 
dorsement frequency of the normative sample. 

3. Items which show a large difference in endorse- 
ment frequency between the depression and 
normative samples are selected for the Depres- 
sion scale, keyed in the direction favored by de- 
pression subjects (true or false, as appropriate). 

4. Raw score on the Depression scale is then sim- 
ply the number of items answered in the keyed 
direction. 


The method of empirical keying can produce 
some interesting surprises. A common finding is 
that some items selected for a scale may show no 
obvious relationship to the construct measured. For 
example, an item such as “I drink a lot of water” 
(keyed true) might end up on a Depression scale. 
The momentary rationale for including this item is 
simply that it works. Of course, the challenge 
posed to researchers is to determine why the item 
works. However, from the practical standpoint of 
empirical scale construction, theoretical consider- 
ations are of secondary importance. We discuss the 
method of empirical keying further in Topic 14A, 
Self-Report Inventories. 


Rational Scale Construction 
(Internal Consistency) 


The rational approach to scale construction is a 
popular method for the development of self-report 
personality inventories. The name rational is some- 
what of a misnomer, insofar as certain statistical 


methods are essential to this approach. Also, the 
name implies that other approaches are nonrational 
or irrational, which is untrue. The heart of the 
method of rational scaling is that all scale items 
correlate positively with each other and also with 
the total score for the scale. An alternative and 
more appropriate name for this approach is internal 
consistency, which emphasizes what is actually 
done. Gough and Bradley (1992) explain how the 
rational approach earned its descriptive title: 


The idea of rationality enters the scene in that 

the central theme or unifying dimension around 
which the items cluster is one that was conceptually 
articulated beforehand by the developer of the mea- 
sure and from which the scoring of each item is de- 
termined in a logical and understandable way. 


We will follow their presentation to illustrate the 
features of the rational approach. 

Suppose a test developer desires to develop a 
new self-report scale for leadership potential. 
Based upon a review of relevant literature, the re- 
searcher might conclude that leadership potential 
is characterized by self-confidence, resilience 
under pressure, high intelligence, persuasiveness, 
assertiveness, and the ability to sense what others 
are thinking and feeling. These notions suggest 
that the following true-false items might be useful 
in the assessment of leadership potential (Gough & 
Bradley, 1992): 


e I generally feel sure of myself and self-confident. 
(T) 

¢ When others disagree with me, I usually just 

keep quiet or else give in. (F) 

I believe that I am distinctly above average in in- 

tellectual ability. (T) 

I often feel that I have a poor understanding of 

how other people will react to things. (F) 

My friends would probably describe me as a 

strong, forceful person. (T) 


The T and F after each statement indicate the ra- 
tionally keyed direction for leadership potential. 
Of course, additional items with similar inten- 
tions also would be proposed. The test developer 
might begin with 100 items that appear—on a ratio- 
nal basis—to assess leadership potential. These pre- 
liminary items would be administered to a large 
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sample of individuals similar to the target popula- 
tion for whom the scale is intended. For instance, if 
the scale is designed to identify college students 
with leadership potential, then it should be adminis- 
tered to a cross section of several hundred college 
students. For scale development, very large samples 
are desirable. In this hypothetical case, let us as- 
sume that we obtain results for 500 college students. 

The next step in rational scale construction is to 
correlate scores on each of the preliminary items 
with the total score on the test for the 500 subjects 
in the tryout sample. Because scores on the items 
are dichotomous (1 is arbitrarily assigned to an an- 
swer corresponding to the scoring key, 0 to the al- 
ternative), a biserial correlation coefficient rp; is 
needed. Once the correlations are obtained, the re- 
searcher scans the list in search of weak correla- 
tions and reversals (negative correlations). These 
items are discarded because they do not contribute 
to the measurement of leadership potential. Up to 
half of the initial items might be discarded. If a 
large proportion of items is initially discarded, the 
researcher might recalculate the item-total correla- 
tions based upon the reduced item pool to verify 
the homogeneity of the remaining items. The items 
that survive this iterative procedure constitute the 
leadership potential scale. The reader should keep 
in mind that the rational approach to scale con- 
struction merely produces a homogeneous scale 
thought to measure a specified construct. Addi- 
tional studies with new subject samples would be 
needed to determine the reliability and validity of 
the new scale. 


II CONSTRUCTING THE ITEMS 


Constructing test items is a painful and laborious 
procedure that taxes the creativity of test develop- 
ers. The item writer is confronted with a profusion 
of initial questions: 


° Should item content be homogeneous or varied? 

¢ What range of difficulty should the items cover? 

e How many initial items should be constructed? 

e Which cognitive processes and item domains 
should be tapped? 

e What kind of test item should be used? 
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We will address the first three questions briefly be- 
fore turning to a more detailed discussion of the 
last-two topics, which are commonly referred to 
under the rubrics of table of specifications and item 
formats. 


Initial Questions in Test Construction 


The first question pertains to the homogeneity ver- 
sus heterogeneity of test item content. In large 
measure, whether item content is homogeneous or 
varied is dictated by the manner in which the test 
developer has defined the new instrument. Con- 
sider a culture-reduced test of general intelligence. 
Such an instrument might incorporate varied 
items, so long as the questions do not presume spe- 
cific schooling. The test developer might seek to 
incorporate novel problems equally unfamiliar to 
all examinees. On the other hand, with a theory- 
based test of spatial thinking, subscales with ho- 
mogeneous item content would be required. 

The range of item difficulty must be sufficient 
to allow for meaningful differentiation of exami- 
nees at both extremes. The most useful tests, then, 
are those that'include a graded series of very easy 
items passed by almost everyone as well as a group 
of incrementally more difficult items passed by 
virtually no one. A ceiling effect is observed when 
significant numbers of examinees obtain perfect or 
near-perfect scores. The problem with a ceiling 
effect is that distinctions between high-scoring 
examinees are not possible, even though these ex- 
aminees might differ substantially on the under- 
lying trait measured by the test. A floor effect is 
observed when significant numbers of examinees 
obtain scores at or near the bottom of the scale. 
For example, the WAIS-R has a serious floor effect 
in that it fails to discriminate between moderate, 
severe, and profound levels of mental retarda- 
tion—all persons with significant developmental 
disabilities fail to answer virtually every question. 

Test developers expect that some initial items 
will prove to make ineffectual contributions to the 
overall measurement goal of their instrument. For 
this reason, it is common practice to construct a 
first draft that contains excess items, perhaps dou- 


ble the number of questions desired on the final 
draft. For example, the 550-item MMPI originally 
consisted of more than 1,000 true-false personality 
statements (Hathaway & McKinley, 1940). 


Table of Specifications 


Professional developers of achievement and ability 
tests often use one or more item-writing schemes to 
help ensure that their instrument taps a desired 
mixture of cognitive processes and content do- 
mains. For example, a very simple item-writing 
scheme might designate that an achievement test 
on the Civil War should consist of 10 multiple- 
choice items and 10 fill-in-the-blank questions, 
half of each on factual matters (e.g., dates, major 
battles) and the other half on conceptual issues 
(e.g., differing views on slavery). 

Before development of a test begins, item writ- 
ers usually receive a table of specifications. A 
table of specifications enumerates the information 
and cognitive tasks on which examinees are to be 
assessed. Perhaps the most common specification 
table is the content-by-process matrix, which lists 
the exact number of items in relevant content areas 
and details the precise composite of items that 
must exemplify different cognitive processes 
(Millman & Greene, 1989). 

Consider a science achievement test suitable 
for high school students. Such a test must cover 
many different content areas and should require a 
mixture of cognitive processes ranging from sim- 
ple recall to inferential reasoning. By providing a 
table of specifications prior to the item-writing 
stage, the test developer can guarantee that the re- 
sulting instrument contains a proper balance of 
topical coverage and taps a desired range of cogni- 
tive skills. A hypothetical but realistic table of 
specifications is portrayed in Table 4.6. 


Item Formats 


When it comes to the method by which psycho- 
logical attributes are to be assessed, the test devel- 
oper is confronted with dozens of choices. Indeed, 
it would be easy to write an entire chapter on this 


TABLE4.6 Example of a Content-by-Process 
Table of Specifications for a Hypothetical 100-Item 
Science Achievement Test 


Process 
Content Factual Information Inferential 
Area Knowledge? Competence” Reasoning“ 
Astronomy 8 3 3 
Botany 6 7 Z 
Chemistry 10 5 4 
Geology 10 5 2 
Physics 8 5 6 
Zoology „8 > TER 
Totals 50 30 20 





Factual Knowledge: Items can be answered based on simple rec- 
ognition of basic facts. 


binformation Competence: Items require usage of information pro- 
vided in written text. 


“Inferential Reasoning: Items can be answered by making deduc- 
tions or drawing conclusions. 


topic alone. For reviews of item formats, the inter- 
ested reader should consult Bausell (1986), Jensen 
(1980), and Wesman (1971). In this section, we will 
quickly survey the advantages and pitfalls of the 
more common varieties of test items. 

For group-administered tests of intellect or 
achievement, the technique of choice is the multiple- 
choice question. For example, an item on an Amer- 
ican history achievement. test might include this 
combination of stem and options: 


The president of the United States during the Civil War 
was 


a. Washington 
b. Lincoln 

c. Hamilton 
d. Wilson 


Proponents of multiple-choice methodology argue 
that properly constructed items can measure concep- 
tual as well as factual knowledge. Multiple-choice 
tests also permit quick and objective machine scor- 
ing. Furthermore, the fairness of multiple-choice 
questions can be proved (or occasionally dis- 
proved!) with very simple item analysis procedures 
discussed subsequently. The major shortcomings of 
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multiple-choice questions are, first, the difficulty of 
writing good distractor options and, second, the 
possibility that the presence of the response: may 
cue a half-knowledgeable respondent to the correct 
answer. Guidelines for writing good multiple- 
choice items are listed in Table 4.7. 

Matching questions are popular in classroom 
testing, but suffer serious psychometric shortcom- 
ings. An example of a matching question: 


Using the letters on the left, match the name to the 
accomplishment: 


A. Binet _____ translated a major intelligence 
test 

B. Woodworth no correlation between grades 
and mental tests 

C. Cattell ___ developed true/false personality 
inventory 

D. McKinley battery of sensorimotor tests 

E.. Wissler developed first useful intelli- 
gence test 

F. Goddard screening test for emotional 
disturbance 


The most serious problem with matching questions 
is that responses are not independent—missing 
one match usually compels the examinee to miss 
another. Another problem is that the options ina 
matching question must be very closely related or 
the question will be too easy. 


TABLE 4.7 Guidelines for Writing Multiple- 
Choice Items 





Choose words that have precise meanings. 

Avoid complex or awkward word arrangements. 
Include all information needed for response selection. 
Put as much of the question as possible in the stem. 
Do not take stems verbatim from textbooks. 

Use options of equal length and parallel phrasing. 
Use “none of the above” and “all of the above” rarely. 
Minimize the use of negatives such as not. 

Avoid the use of nonfunctional words. 

Avoid unessential specificity in the stem. 

Avoid unnecessary clues to the correct response. 
Submit items to others for editorial scrutiny. 
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For individually administered tests, the proce- 
dure of choice is the short-answer objective item. 
Indeed, the simplest and most straightforward 
types of questions often possess the best reliability 
and validity. A case in point is the Vocabulary sub- 
test from the WAIS-III, which consists merely of 
asking the examinee to define words. This subtest 
has very high reliability (.96) and is usually con- 
sidered the single best measure of overall intelli- 
gence on the WAIS-III (Gregory, 1999). 

Personality tests often use true-false questions 
because they are easy for subjects to understand. 
Most people find it simple to answer true or false 
to items such as: 


iE F 
I like sports magazines. 


Critics of this approach have pointed out that an- 
swers to such questions may reflect social desir- 
ability rather than personality traits (Edwards, 
1961). An alternative format designed to counter- 
act this problem is the forced-choice methodol- 
ogy in which the examinee must choose between 
two equally desirable (or undesirable) options: 


Which would you rather do: 
Mop a gallon of syrup from the floor. 
Volunteer for a half day at a nursing home. 


Although the forced-choice approach has many de- 
sirable psychometric properties (Zavala, 1965), 
personality test developers have not rushed to em- 
brace this interesting methodology. 


Testing THE ITEMS 


Psychometricians expect that numerous test items 
from the original tryout pool will be discarded or 
revised as test development proceeds. For this rea- 
son, test developers initially produce many, many 
excess items, perhaps double the number of items 
they intend to use. So, how is the final sample of 
test questions selected from the initial item pool? 
Test developers use item analysis, a family of sta- 
tistical procedures, to identify the best items. In 


general, the purpose of item analysis is to deter- 
mine which items should be retained, which re- 
vised, and which thrown out. In conducting a 
thorough item analysis, the test developer might 
make use of item-difficulty index, item-reliability 
index, item-validity index, item-characteristic curve, 
and an index of item discrimination. We turn now 
to a brief review of these statistical approaches to 
item analysis. Readers who wish an in-depth dis- 
cussion and critique of these topics should consult 
Hambleton (1989) and Nunnally (1978). 


Item-Difficulty Index 


The item difficulty for a single test item is defined 
as the proportion of examinees in a large tryout 
sample who get that item correct. For any individ- 
ual item i, the index of item difficulty is p, which 
varies from 0.0 to 1.0. An item with difficulty of .2 
is more difficult than an item with difficulty of .7, 
because fewer examinees answered it correctly. 

The item-difficulty index is a useful tool for 
identifying items that should be altered or dis- 
carded. Suppose an item has a difficulty index near 
0.0, meaning that nearly everyone has answered it 
incorrectly. Unfortunately, this item is psychomet- 
rically unproductive because it does not provide in- 
formation about differences between examinees. 
For most applications, the item should be rewritten 
or thrown out. The same can be said for an item 
with difficulty index near 1.0, where virtually all 
subjects provide a correct answer. 

What is the optimal level of item difficulty? 
Generally, item difficulties that hover around .5, 
ranging between .3 and .7, maximize the informa- 
tion the test provides about differences between 
examinees. However, this rule of thumb is subject 
to one important qualification and one very signif- 
icant exception. 

For true-false or multiple-choice items, the op- 
timal level of item difficulty needs to be adjusted 
for the effects of guessing. For a true-false test, a 
difficulty level of .5 can result when examinees 
merely guess. Thus, the optimal item difficulty for 
such items would be .75 (halfway between .5 and 
1.0). In general, the optimal level of item difficulty 


can be computed from the formula (1.0 + g)/2, 
where g is the chance success level. Thus, for a 
four-option multiple-choice item, the chance suc- 
cess level is .25, and the optimal level of item dif- 
ficulty would be (1.0 + .25)/2, or about .63. 

If a test is to be used for selection of an extreme 
group by means of a cutting score, it may be desir- 
able to select items with difficulty levels outside 
the .3 to .7 range. For example, a test used to select 
graduate students for a university that admits only 
a select few of its many applicants should contain 
many. very difficult items. A test used to designate 
children for a remedial-education program should 
contain many extremely easy items. In both cases, 
there will be useful discrimination among exami- 
nees near the cutting score—a very high score for 
the graduate admissions and a very low score for 
students eligible for remediation—but little dis- 
crimination among the remaining examinees 
(Allen & Yen, 1979). 


Item-Reliability Index 


A test developer may desire an instrument with a 
high level of internal consistency in which the items 
are reasonably homogeneous. A simple way to de- 
termine whether an individual item “hangs to- 
gether” with the remaining test items is to correlate 
scores on that item with scores on the total test. 
However, individual items are typically right or 
wrong (often scored 1 or 0), whereas total scores 
constitute a continuous variable. In order to correlate 
these two different kinds of scores it is necessary 
to use a special type of statistic called the point- 
biserial correlation coefficient. The computational 
formula for this correlation coefficient is equivalent 
to the Pearson r discussed earlier, and the point- 
biserial coefficient conveys much the same kind of 
information regarding the relationship between two 
variables (one of which happens to be dichotomous 
and scored 0 or 1). In general, the higher the point- 
biserial correlation rp between an individual item 
and the total score, the more useful is the item from 
the standpoint of internal consistency. 

The usefulness of an individual dichotomous 
test item is also determined by the extent to which 
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scores on it are distributed between the two out- 
comes of 0 and 1. Although it sounds incongruous, 
it is possible to compute the standard deviation for 
dichotomous items; as with a continuously scored 
variable, the standard deviation of a dichotomous 
item indicates the extent of dispersion of the 
scores. If an individual item has a standard devia- 
tion of zero, everyone is obtaining the same score 
(all right or all wrong). The more closely the item 
approaches a 50-50 split of right and wrong scores, 
the greater is its standard deviation. In general, the 
greater the standard deviation of an item, the more 
useful is the item to the overall scale. Although we 
will not provide the derivation, it can be shown that 
the item-score standard deviation s; for a dichoto- 
mously scored item can be computed from 


5;= VD; ad =p} 


We may summarize the discussion up to this 
point as follows: The potential value of a dichoto- 
mously scored test item depends jointly upon its 
internal consistency as indexed by the correlation 
with the total score (7,7) and also its variability as 
indexed by the standard deviation (s,). If we com- 
pute the product of these two indices, we obtain 
Sir; which is the item-reliability index. Consider 
the characteristics of an item that possesses a rela- 
tively large item-reliability index. Such an item 
must exhibit strong internal consistency and pro- 
duce a good dispersion of scores between its two 
alternatives. The value of this index in test con- 
struction is simply this: By computing the item- 
reliability index for every item in the preliminary 
test, we can eliminate the “outlier” items that have 
the lowest value on this index. Such items would 
possess poor internal consistency or weak disper- 
sion of scores and therefore not contribute to the 
goals of measurement. 


Item-Validity Index 


For many applications, it is important that a test 
possess the highest possible concurrent or predic- 
tive validity. In these cases, one overriding ques- 
tion governs test construction: How well does each 
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preliminary test item contribute to accurate predic- 
tion of the criterion? The item-validity index is a 
useful tool in the psychometrician’s quest to iden- 
tify predictively useful test items. By computing 
the item-validity index for every item in the pre- 
liminary test, the test developer can identify in- 
effectual items, eliminate or rewrite them, and 
produce a revised instrument with greater practical 
utility. 

The first step in figuring an item-validity index 
is to compute the point-biserial correlation be- 
tween the item score and the score on the criterion 
variable. In general, the higher the point-biserial 
correlation r;ç between scores on an individual 
item and the criterion score, the more useful is the 
item from the standpoint of predictive validity. As 
previously noted, the utility of an item also de- 
pends upon its standard deviation s,. Thus, the 
item-validity index consists of the product of the 
standard deviation and the point-biserial correla- 
tion: src- 


Item-Characteristic Curves 


An item-characteristic curve (ICC) is a graphical 
display of the relationship between the probability 
of a correct response and the examinee’s position 
on the underlying trait measured by the test. How- 
ever, we do not have direct access to underlying 
traits, so observed test scores must be used to esti- 
mate trait quantities. 

A separate ICC is graphed for each item, based 
upon a plot of the total test scores on the horizon- 
tal axis versus the proportion of examinees passing 
the item on the vertical axis (Figure 4.8). An ICC 
is actually a mathematical idealization of the rela- 
tionship between the probability of a correct re- 
sponse and the amount of the trait possessed by test 
respondents. Different ICC models use different 
mathematical functions based upon initial assump- 
tions. The simplest ICC model is the Rasch Model, 
based upon the item-response theory of the Danish 
mathematician Georg Rasch (1966). The Rasch 
Model is the simplest model because it makes just 
two assumptions: (1) test items are unidimensional 


and measure one common trait, and (2) test items 
vary upon a continuum of difficulty level. 

In general, a good item has a positive ICC 
slope. If the ability to solve a particular item is nor- 
mally distributed, the ICC will resemble a normal 
ogive (curve a in Figure 4.8). The normal ogive is 
simply the normal distribution graphed in cumula- 
tive form. 

The desired shape of the ICC depends upon the 
purpose of the test. Psychometric purists would 
prefer that test item ICCs approximate the normal 
ogive, because this curve is convenient for making 
mathematical deductions about the underlying trait 
(Lord & Novick, 1968). However, for selection de- 
cisions based on cutoff scores, a step function is 
preferred. For example, when combined with other 
similar items, the item that produced curve b in 
Figure 4.8 would be the best for selecting exami- 
nees with high levels of the measured trait. 

ICCs are especially useful for identifying items 
that perform differently for subgroups of exami- 
nees (Allen & Yen, 1979). For example, a test 
developer may discover that an item performs dif- 
ferently for men and women. A sex-biased ques- 
tion involving football facts comes to mind here. 
For men, the ICC for this item might have the de- 
sired positive slope, whereas for women the ICC 
might be quite flat (such as curve c in Figure 4.8). 
Items with ICCs that differ among subgroups of 
examinees can be revised or eliminated. 
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FIGURE 4.8 Some Sample Item-Characteristic Curves 


The underlying theory of ICC is also known as 
item response theory and latent trait theory. The 
usefulness of this approach has been questioned by 
Nunnally (1978), who points out that the assump- 
tion of test unidimensionality (implied in the ICC 
curve, which plots percentage passing against the 
unidimensional horizontal axis of trait value) is vi- 
olated when many psychological tests are consid- 
ered. If there were no serious technical and 
practical problems involved, “one wonders why 
ICC theory was not adopted long ago for the actual 
construction and scoring of tests” (Nunnally, 
1978). 

The merits of the ICC approach are still de- 
bated. ICC theory seems particularly appropriate 
for certain forms of computerized adaptive testing 
(CAT) in which each test taker responds to an indi- 
vidualized and unique set of items that are then 
scored on an underlying uniform scale (Weiss, 
1983). The CAT approach to assessment would not 
be possible in the absence of an ICC approach to 
measurement. CAT is discussed in Topic 15A, 
Computerized Assessment and the Future of Test- 
ing. Readers who wish a more detailed discussion 
of ICC and other latent trait models should consult 
Anastasi (1988), Hambleton (1989), and Wright 
and Stone (1979). 


Item-Discrimination Index 


It should be clear from the discussion of ICCs that 
an effective test item is one that discriminates be- 
tween high scorers and low scorers on the 
entire test. An ideal test item is one that most of 
the high scorers pass and most of the low scorers 
fail (see curve a in Figure 4.8). Simple visual in- 
spection of the ICC provides a coarse basis for 
gauging the discriminability of a test item: If the 
slope of the curve is positive and the curve is 
preferably ogive-shaped, the item is doing a good 
job of separating high and low scorers. But visual 
inspection is not a completely objective proce- 
dure; what is needed is a statistical tool that sum- 
marizes the discrimination power of individual test 
items. 
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An item-discrimination index is a statistical 
index of how efficiently an item discriminates be- 
tween persons who obtain high and low scores on 
the entire test. There are many indices of item dis- 
crimination, including such indirect measures as 
rip the point-biserial correlation between scores 
on an individual item and the total test score. How- 
ever, we will restrict our discussion here to a direct 
measure, the item-discrimination index, symbol- 
ized by the lowercase, italicized letter d. On an 
item-by-item basis, this index compares the per- 
formance of subjects in the upper and lower re- 
gions of total test score. The upper and lower 
ranges are generally defined as the upper- and 
lower-scoring 10 percent to 33 percent of the sam- 
ple. If the total test scores are normally distributed, 
the optimal comparison is the highest-scoring 27 
percent versus the lowest-scoring 27 percent of 
the examinees. If the distribution of total test 
scores is flatter than the normal curve, the optimal 
percentage is larger, approaching 33 percent. For 
most applications, any percentage between 25 and 
33 will yield similar estimates of d (Allen & Yen, 
1979). 

The item-discrimination index for a test item is 
calculated from the formula: 


d=(U-L)/N 


where U is the number of examinees in the upper 
range who answered the item correctly, L is the 
number of examinees in the lower range who an- 
swered the item correctly, and N is the total num- 
ber of examinees in the upper or lower range: 

Let us illustrate the computation and use of d 
with a hypothetical example. Suppose that a test 
developer has constructed the preliminary version 
of a multiple-choice achievement test and has ad- 
ministered the exam to a tryout sample of 400 high 
school students. After computing total scores for 
each subject, the test developer then identifies the 
high-scoring 25 percent and low-scoring 25 per- 
cent of the sample. Since there are 100 students in 
each group (25 percent of 400), N in the preceding 
formula will be 100. Next, for each item, the de- 
veloper determines the number of students in the 


132 


upper range and the lower range who answered it 
correctly. To compute d for each item is a simple 
matter of plugging these values into the formula 
(U - L)/N. For example, suppose on the first item 
that 49 students in the upper range answered it cor- 
rectly, whereas 23 students in the lower range 
answered it correctly. For this item, d is equal to 
(49 — 23)/100 or .26. 

It is evident from the formula for d that this 
index can vary from —1.0 to +1.0. Notice, too, that 
a negative value for d is a warning signal that a test 
item needs revision or replacement. After all, such 
an outcome indicates that more of the low-scoring 
subjects answered the item correctly than did the 
high-scoring subjects. If d is zero, exactly equal 
numbers of low- and high-scoring subjects an- 
swered the item correctly; since the item is not 
discriminating between low- and high-scoring sub- 
jects at all, it should be revised or eliminated. A 
positive value for d is preferred, and the closer to 
+1.0 the better. Table 4.8 illustrates item-discrimi- 
nation indices for six items from the hypothetical 
test proposed here. 

A test developer can supplement the item-dis- 
crimination approach by inspecting the number of 
examinees in the upper- and lower-scoring groups 
who choose each of the incorrect alternatives. If a 
multiple-choice item is well written, the incorrect 
alternatives should be equally attractive to subjects 
who do not know the correct answer. Of course, we 
expect that high-scoring examinees will choose the 
correct alternative more often than low-scoring ex- 
aminees—that is the purpose in computing item- 
discrimination indices. But, in addition, a good item 


TABLE 4.8 
Item U E 

1 49 23 
2 79 19 
3 52 52 
4 100 0 
5 20 80 
6 0 100 
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should show proportional dispersion of incorrect 
choices for both high- and low-scoring subjects. 

Assume that we investigate the choices of 100 
high-scoring and 100 low-scoring subjects on a hy- 
pothetical multiple-choice test. Correct choices are 
indicated by an asterisk (*). Item 1 demonstrates 
the desired pattern of answers, with incorrect 
choices about equally dispersed 


Alternatives 
Item 1 a BIENEN dRe 
High Scorers BGF“ BUN EINE 
Low Scorers 15 14 40 16 15 


On item 2, we notice that no examinees picked al- 
ternative d. This alternative should be replaced 
with a more appealing distractor: 


Item 2 BO) BF COA Ki 


High Scorers > Bee PL 
Low Scorers 21 34 20 0 25 


Item 3 is probably a poor item in spite of the fact 
that it discriminates effectively between high- and 
low-scoring subjects. The obvious problem is that 
high-scoring examinees prefer alternative a to the 
correct alternative, d; 


Item 3 a b ER e 


High Scorers ~ 43: 6" "5: 377-9 
Low Scorers 20 19 22 10 29 


Perhaps by rewriting alternative a, this item could 
be rescued. In any case, the main point here is that 
test developers should pry into every corner of 
every test item by every means possible, including 
visual inspection of the pattern of answers. 


Item-Discrimination Indices for Six Hypothetical Items 


(U-LJYN Comment 
.26 Very good item with high difficulty 
.60 Excellent item but rarely achieved 
.00 Poor item that should be revised 
1.00 Ideal item but never achieved 
-.60 Terrible item that should be eliminated 
-1.00 Theoretically worst possible item 





Reprise: The Best Items 


From all the methods of item analysis previously 
portrayed, which ones should the test developer use 
to identify the best items for a test? The answer to 
this question is neither simple nor straightforward. 
After all, the choice of “best” items depends upon 
the objectives of the test developer. For example, a 
theoretically inclined research psychologist might 
desire a measurement instrument with the highest 
possible internal consistency; item-reliability in- 
dices are crucial to this goal. A practically minded 
college administrator might wish for an instrument 
with the highest possible criterion validity; item- 
validity indices would be useful for this purpose. A 
remediation-oriented mental retardation specialist 
might desire an intelligence test with minimal floor 
effect; item-difficulty indices would be helpful in 
this regard. In sum, there is no single preferred 
method for item selection ideally suited to every 
context of assessment and test development. 


I REVISING THE TEST 


The purpose of item analysis, discussed previ- 
ously, is to identify unproductive items in the 
preliminary test so that they can be revised, elimi- 
nated, or replaced. Very few tests emerge from this 
process unscathed. It is common in the evolution- 
ary process of test development that many items 
are dropped, others refined, and new items added. 
The initial repercussion is that a new and slightly 
different test emerges. This revised test likely con- 
tains more discriminating items with higher relia- 
bility and greater predictive accuracy—but these 
improvements are known to be true only for the 
first tryout sample. 

The next step in test development is to collect 
new data from a second tryout sample. Of course, 
these examinees should be similar to those for 
whom the test is ultimately intended. The purpose 
of collecting additional test data is to repeat the 
item analysis procedures anew. If further changes 
are of the minor fine-tuning variety, the test devel- 
oper may decide the test is satisfactory and ready 
for cross-validational study, discussed in the fol- 
lowing section. If major changes are needed, it is 
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desirable to collect data from a third and even per- 
haps a fourth tryout sample. But at some point, 
psychometric tinkering must end; the developer 
must propose a finalized instrument and proceed to 
the next step, cross validation. 


Cross Validation 


When a tryout sample is used to ascertain that a 
test possesses criterion-related validity, the evi- 
dence is quite preliminary and tentative. It is pru- 
dent practice in test development to seek fresh and 
independent confirmation of test validity before 
proceeding to publication. The term cross valida- 
tion refers to the practice of using the original re- 
gression equation in a new sample to determine 
whether the test predicts the criterion as well as it 
did in the original sample. Ghiselli, Campbell, and 
Zedeck (1981) outline the rationale for cross vali- 
dation: 


Whether items are chosen on the basis of empirical 
keying or whether they are corrected or weighted, 
the obtained results should, unless additional data 
are collected, be viewed as specific to the sample 
used for the statistical analyses. This is necessary 
because the obtained results have likely capitalized 
on chance factors operating in that group and there- 
fore are applicable only to the sample studied. 


Validity Shrinkage 


A common discovery in cross-validation research is 
that a test predicts the relevant criterion less ac- 
curately with the new sample of examinees than 
with the original tryout sample. The term validity 
shrinkage is applied to this phenomenon. For ex- 
ample, a biographically based predictor of sales po- 
tential might perform quite well for the sample of 
subjects used to develop the instrument, but demon- 
strate less validity when applied to a new group of 
examinees. Mitchell and Klimoski (1986) studied 
validity shrinkage of an instrument designed to 
foretell which students will succeed in real estate, as 
measured by the real-world criterion of obtaining a 
real estate license two years later. In one analysis 
based on the sample used to derive the test, the 
biographically based predictor test correlated .6 
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with the criterion. But when this same test was tried 
out on a new sample of real estate students, the cor- 
relation with the criterion was lower, about .4, 
demonstrating typical validity shrinkage. 

Validity shrinkage is an inevitable part of test 
development and underscores the need for. cross 
validation. In most cases, shrinkage is slight and 
the instrument withstands the challenge of cross 
validation. However, shrinkage of test validity can 
be a major problem when derivation and cross- 
validation samples are small, the number of poten- 
tial test items is large, and items are chosen on a 
purely empirical basis without theoretical rationale. 

A classic paper by Cureton (1950) demonstrates 
a worst-case scenario: using a very small sample to 
select empirically keyed items from a large item 
pool, then validating the test on the same sample. 
The criterion in his study was grade point average, 
artificially dichotomized into grades of B or better 
and grades below B. His “test” items consisted of 85 
tags, numbered on one side. For each of 29 students, 
the tags were shaken in a container and dropped on 
the table. All tags that fell with numbers up were 
recorded as indicating the presence of that “item” 
for the student, Next, Cureton conducted an item 
analysis, using the dichotomized grades as the cri- 
terion. Based on this analysis, 24 items were found 
to be maximally predictive of students’ grades. Nine 
items occurred more often among students with the 
higher grades, and these items were weighted +1. 
Fifteen items occurred more often among students 
with the lower grades, and these items were 
weighted —1. The score on this test (facetiously 
named the “B-Projective Psychokinesis Test”) con- 
sisted of the sum of these 24 item weights. 

In spite of the nonsensical nature of his test, 
Cureton (1950) found that test scores correlated 
.82 with grades. Of course, the strength of this cor- 
relation was due entirely to capitalization upon 
chance. If we were to conduct a series of cross- 
validation studies using new samples of students, 
the correlation between the B-Projective Psychoki- 
nesis Test and grades would likely hover right 
around zero, because this test is completely devoid 
of predictive validity. There is an important lesson 
here that applies to serious tests as well: Demon- 
strate validity through cross validation, do not as- 


sume it based merely on the solemn intentions of a 
new instrument. 


Feedback from Examinees 


In test revision, feedback from examinees is a po- 
tentially valuable source of information that is nor- 
mally overlooked by test developers. We can 
illustrate this approach with research by Nevo 
(1992). He developed the Examinee Feedback 
Questionnaire (EFeQ) to study the Inter-University 
Psychometric Entrance Examination, a major re- 
quirement for admission to the six universities in 
Israel. The Inter-University entrance exam is a 
group test consisting of five multiple-choice sub- 
tests: General Knowledge, Figural Reasoning, 
Comprehension, Mathematical Reasoning, and 
English. The EFeQ was designed as an anonymous 
posttest administered immediately after the Inter- 
University entrance exam. 

The EFeQ is a short and simple questionnaire 
designed to elicit candid opinions from examinees 
as to these features of the test-examiner-respon- 
dent matrix: 


Behavior of examiners 

Testing conditions 

Clarity of exam instructions 
Convenience in using the answer sheet 
Perceived suitability of the test 

e Perceived cultural fairness of the test 
e Perceived sufficiency of time 
Perceived difficulty of the test 
Emotional response to the test 

e Level of guessing 

e Cheating by the examinee or others 


The final question on the EFeQ is an open-ended 
essay: “We are interested in any remarks or sug- 
gestions you might have for improving the exam.” 
Some examples of feedback questions in the EFeQ 
tradition are provided in Figure 4.9. 

Nevo (1992) determined that the EFeQ ques- 
tionnaire possesses modest reliability, with a test- 
retest reliability of about .70. Regardless of the 
psychometric properties of his scale, the tradition of 
asking examinees for feedback about tests has 
proved invaluable. The Inter-University entrance 
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What is your opinion of the amount of time alloted for each test? Mark each box with a 


number from 1 to 5 according to these ratings: 


5 4 3 2 
Way too Toomuch Adequate Too little Way too 
much time time time time little time 
CI General Knowledge 
C Figural Reasoning 
CI Comprehension 
CI Mathematical Reasoning 
C English 


Did you or others cheat on this exam? Please check the boxes that apply. You can check 


more than one box. 
Yes—I obtained a copy of the test. 
Yes—one of the testers illegally helped me. 


Yes—I helped one of the other examinees. 
Yes—I used hidden notes during the test, 
Yes—I saw another person cheating. 
No—I did not cheat in any way. 

No—I did not see anyone else cheating. 


OOUUO0000 





exam was modified in numerous ways in response to 
feedback: The answer sheet format was modified in 
ways suggested by examinees; the time limit was in- 
creased for specific tests reported to be too speeded; 
certain items perceived as culturally biased or unfair 
were deleted. In addition, security measures were 
revised and tightened in order to minimize cheating, 
which was much more prevalent than examiners had 
anticipated. Nevo (1992) also cites a hidden advan- 
tage to feedback questionnaires: They convey the 
message that someone cares enough to listen, which 
reduces postexamination stress. Examinee feedback 
questionnaires should become a routine practice in 
group standardized testing. 


I PUBLISHING THE TEST 


The test construction process does not end with the 
collection of cross-validation data. The test devel- 
oper also must oversee the production of the test- 
ing materials, publish a technical manual, and 
produce a user’s manual. A number of relevant 
guidelines can be offered for each of these final 


Yes—one of the examinees helped me during the test. 


FIGURE 4.9 

Examples of Examinee 
Feedback Questionnaire 
Items 

Source: Based upon Nevo, B. 
(1992). Examinee feedback: 
Practical guidelines. In M. 
Zeidner & R. Most (Eds.), 
Psychological testing: An 
inside view, Palo Alto, CA: 
Consulting Psychologists 
Press. 


steps, as outlined in the following sections. Finally, 
we close this chapter with a provocative comment 
on the conservatism of modern test publishers. 


Production of Testing Materials 


Testing materials must be user-friendly if they are 
to receive wide acceptance by psychologists and 
educators. Thus, a first guideline for test produc- 
tion is that the physical packaging of test materials 
must allow for quick and smooth administration. 
Consider the challenge posed by: some. perfor- 
mance tests, in which the examiner must wrestle 
with pencil, clipboard, test form, stopwatch, test 
manual, item shield, item box, and a disassembled 
cardboard object, all the while maintaining conver- 
sation with the examinee. If it is possible for the 
test developer to simplify the duties of the examiner 
while leaving examinee task demands unchanged, 
the resulting instrument will have much greater a- 
cceptability to potential users. For example, if the 
administration instructions can be summarized on 
the test form, the examiner can put the test manual 
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aside while setting out the task for the examinee. 
Another welcome addition to psychological test 
packaging is the stand-up ring binder that shows 
the test question on the side facing the examinee 
and provides instructions for administration on the 
reverse side facing the examiner. i 


Technical Manual and User's Manual 


Technical data about a new instrument are usually 
summarized with appropriate references in a tech- 
nical manual. Here, the prospective user can find 
information about item analyses, scale reliabilities, 
cross-validation studies, and the like. In some cases, 
this information is incorporated in the user’s man- 
ual, which gives instructions for administration and 
also provides guidelines for test interpretation. 
Test manuals should communicate information 
to many different groups ranging in background and 
training from measurement specialist to classroom 
teacher. Test manuals serve many purposes, as out- 
lined in the Standards for Educational and Psycho- 
logical Testing (AERA, APA, & NCME, 1985, 
1999). The influential Standards manual suggests 
that test manuals accomplish the following goals: 


e Describe the rationale and recommended uses 
for the test 

e Provide specific cautions against anticipated 
misuses of a test 

e Cite representative studies regarding general and 
specific test uses 

e Identify special qualifications needed to admin- 
ister and interpret the test 

¢ Provide revisions, ammendations, and supple- 
ments as needed 

¢ Use promotional material that is accurate and 
research-based 


e Cite quantitative relationships between test 
scores and criteria 

e Report on the degree to which alternative modes 
of response (e.g., booklet versus an answer 
sheet) are interchangeable 

e Provide appropriate interpretive aids to the test taker 

e Furnish evidence of the validity of any auto- 
mated test interpretations 


Finally, test manuals should provide the essential 
data on reliability and validity rather than referring 
the user to other sources—an unfortunate practice 
encountered in some test manuals. 


Testing Is Big Business 


By now the reader should appreciate the intimidat- 
ing task faced by anyone who sets out to develop 
and publish a new test. Aside from the gargantuan 
proportions of the endeavor, test development is ex- 
traordinarily expensive, which means that publish- 
ers are inherently conservative about introducing 
new tests. Jensen (1980) provides the following 
provocative view on this topic: 


To produce a new general intelligence test that would 
be a really significant improvement over existing 
instruments would be a multimillion-dollar project 
requiring a large staff of test construction experts 
working for several years. Today we possess the 
necessary psychometric technology for producing 
considerably better tests than are now in popular use. 
The principal hindrances are copyright laws, vested 
interests of test publishers in the established tests in 
which they have already made enormous invest- 
ments, and the market economy for tests. Significant 
improvement of tests is not an attractive commercial 
venture initially and would probably have to depend 
on large-scale and long-term subsidies from govern- 
ment agencies and private foundations. 


OT Coe Re es aT | 


1. Test construction consists of six inter- 
twined stages: defining the test, selecting a scaling 
method, constructing the items, testing the items, 
revising the test, and publishing the test. 


2. Test developers need to select a scaling 
method that is optimally suited to the manner in 


which they have conceptualized the trait(s) mea- 
sured by the test. The notion of levels of measure- 
ment is highly relevant in this context. 


3. Four levels of measurement are recog- 


nized: Nominal scales constitute mere naming or 
categorizing; ordinal scales allow for ranking; in- 














terval scales possess equal intervals; ratio scales 
incorporate all the previous characteristics and also 
introduce an absolute zero point. 


4. Dozens of scaling methods exist. Repre- 
sentative examples include the method of absolute 
scaling, in which item difficulty is located on an 
axis or baseline and measured in standard devia- 
tion units of an anchor group; Likert scales, which 
present items with five responses ordered on an 
agree/disagree continuum; and the rational scaling 
approach, in which rationally derived items are 
correlated with total test scores. 


5. Constructing test items is a laborious and 
time-consuming procedure. Test developers should 
seek to avoid ceiling and floor effects. In a ceiling 
effect, significant numbers of examinees obtain 
perfect or near-perfect scores. In a floor effect, sig- 
nificant numbers of examinees obtain scores at or 
near the bottom of the scale. 


6. A table of specifications enumerates the 
information and cognitive tasks on which exami- 
nees are to be assessed. With achievement and abil- 
ity tests, item writers usually work from a table of 
specifications to ensure that the emerging instru- 
ment taps the desired mixture of cognitive pro- 
cesses and item contents. 
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7. Test items can be written in many different 
formats, including multiple choice, open-ended re- 
sponse, true-false, and forced-choice. Matching 
questions, popular in classroom testing, are psy- 
chometrically questionable since the choices are 
not independent of one another. 


8. The purpose of item analysis is to deter- 
mine which initial items should be retained, which 
revised, and which thrown out. Many statistical 
procedures are available for item analysis, includ- 
ing item-difficulty index, item-reliability index, 
item-validity index, item-characteristic curve, and 
item-discrimination index. 


9. The term cross validation refers to the 
practice of revalidating a test on a new sample of 
examinees. Validity shrinkage refers to the com- 
mon phenomenon wherein a test predicts the rele- 
vant criterion less accurately with a new sample 
than with the original tryout sample. 


10. Tests must be user-friendly if they are to re- 
ceive wide acceptance by psychologists and educa- 
tors. For example, stand-up ring binders that show 
instructions on one side and display the test stimuli 
on the other side are especially desirable. Test users 
also welcome a thorough technical manual that 
summarizes technical data and validation research. 


KEY TERMS AND CONCEPTS 


nominal scale p. 120 
ordinal scale p. 120 
interval scale p. 120 

ratio scale p. 120 
rankings of experts p. 121 
method of equal-appearing intervals p. 122 
method of absolute scaling p. 123 
Likert scale p. 123 

Guttman scale p. 124 

method of empirical keying p. 125 
method of rational scaling p. 125 
table of specifications p. 126 


forced-choice methodology p. 128 
item-difficulty index p. 128 
item-reliability index p. 129 
item-validity index p. 130 
item-characteristic curve p. 130 
normal ogive p. 130 
item-discrimination index p. 131 
cross validation p. 133 

validity shrinkage p. 133 
technical manual p. 136 

user’s manual p. 136 
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Intelligence Testing I: 
Theories and Preschool 
Assessment 


Topic 5A 


Definitions of Intelligence 


Theories and the Measurement of Intelligence 


Case Exhibit 5.1 Learning and Adaptation as Core Functions of Intelligence 


Theories of Intelligence 
Summary 
Key Terms and Concepts 


Te: chapter opens an extended discussion of 
intelligence testing, a topic so important and 
immense that we devote the next two chapters to it 
as well. In order to understand contemporary intel- 
ligence testing, the reader will need to assimilate 
certain definitions, theories, and mainstream as- 
sessment practices. The goal of Topic SA, Theories 
and the Measurement of Intelligence, is to investi- 
gate the various meanings given to the term intelli- 
gence and to discuss how definitions and theories 
have influenced the structure and content of intel- 
ligence tests. An important justification for this 
topic is that an understanding of theories of intelli- 
gence is crucial for establishing the construct va- 
lidity of IQ measures. In Topic 5B, Assessment of 


138 


Infant and Preschool Abilities, we review the na- 
ture and application of prominent infant assessment 
devices and then investigate a fundamental issue: 
What is the practical utility of these instruments? 
We begin with a review of early, traditional, and 
contemporary theories of intelligence. 
Intelligence is one of the most highly re- 
searched topics in psychology. Thousands of re- 
search articles are published each year on the nature 
and measurement of intelligence. New journals 
such as Intelligence and The Journal of Psychoed- 
ucational Assessment have flourished in response 
to the scholarly interest in this topic. Despite this 
burgeoning research literature, the definition of in- 
telligence remains elusive, wrapped in controversy 
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and mystery. In fact, the discussion that follows will 
illustrate a major paradox of modern testing: Psy- 
chometricians are better at measuring intelligence 
than conceptualizing it! 

Even though defining intelligence has proved 
to be a frustrating endeavor, there is much to be 
gained by reviewing historical- and contemporary 
efforts to clarify its meaning. After all, intelligence 
tests did not materialize out of thin air. Most tests 
are grounded in a specific theory of intelligence and 
most test developers offer a definition of the con- 
struct as a starting point for their endeavors. For 
these reasons, we can better understand and evalu- 
ate the multifaceted character of contemporary 
tests if we first review prominent definitions and 
theories of intelligence. 


I DEFINITIONS OF INTELLIGENCE 


Before we discuss definitions of intelligence, we 
need to clarify the nature of definition itself. Stern- 
berg (1986) makes a distinction between opera- 
tional and “real” definitions that is important in this 
context. An operational definition defines a con- 
cept in terms of the way it is measured. Boring 
(1923) carried this viewpoint to its extreme when 
he defined intelligence as “what the tests test.” Be- 
lieve it or not, this was a serious proposal, designed 
largely to short-circuit rampant and divisive dis- 
agreements about the definition of intelligence. 

Operational definitions of intelligence suffer 
from two dangerous shortcomings (Sternberg, 
1986). First, they are circular. Intelligence tests 
were invented to measure intelligence, not to define 
it. The test designers never intended for their 
instruments to define intelligence. Second, opera- 
tional definitions block further progress in under- 
standing the nature of intelligence, because they 
foreclose discussion on the adequacy of theories of 
intelligence. 

This second problem—the potentially stultify- 
ing effects of relying upon operational definitions 
of intelligence—casts doubt upon the common 
practice of affirming the concurrent validity of new 
tests by correlating them with old tests. If estab- 
lished tests serve as the principal criterion against 


which new tests are assessed, then the new tests 
will be viewed as valid only to the extent that they 
correlate with the old ones. Such a conservative 
practice drastically curtails innovation. The opera- 
tional definition of intelligence does not allow for 
the possibility that new tests or conceptions of in- 
telligence may be superior to the existing ones. 

We must conclude, then, that operational defi- 
nitions of intelligence leave much to be desired. In 
contrast, a real definition is one that seeks to tell 
us the true nature of the thing being defined (Robin- 
son, 1950; Sternberg, 1986). Perhaps the most 
common way—but by no means the only way—of 
producing real definitions of intelligence is to ask 
experts in the field to define it. 


Expert Definitions of Intelligence 


Intelligence has been given many real definitions 
by prominent researchers in the field. Following, 
we list several examples, paraphrased slightly for 
editorial consistency. The reader will note that 
many of these definitions appeared in an early but 
still influential symposium, “Intelligence and Its 
Measurement,” published in the Journal of Educa- 
tional Psychology (Thorndike, 1921). Other defin- 
itions stem from a modern update of this early 
symposium, What Is Intelligence? edited by Stern- 
berg and Detterman (1986). Intelligence has been 
defined as the following: 


Spearman (1904, 1923): a general ability that 
involves mainly the eduction of relations and 
correlates. 

Binet and Simon (1905): the ability to judge 
well, to understand well, to reason well. 

Terman (1916): the capacity to form concepts 
and to grasp their significance. 

Pintner (1921): the ability of the individual to 
adapt adequately to relatively new situations 
in life. 

Thorndike (1921): the power of good responses 
from the point of view of truth or fact. 

Thurstone (1921): the capacity to inhibit in- 
stinctive adjustments, flexibly imagine differ- 
ent responses, and realize modified instinctive 
adjustments into overt behavior. 
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Wechsler (1939): The aggregate or global ca- 
pacity of the individual to act purposefully, 
to think rationally, and to deal effectively 
with the environment. 

Humphreys (1971): the entire repertoire of ac- 
quired skills, knowledge, learning sets, and 
generalization tendencies considered intel- 
lectual in nature that are available at any one 
period of time. 

Piaget (1972): a generic term to indicate the su- 
perior forms of organization or equilibrium 
of cognitive structuring used for adaptation 
to the physical and social environment. 

Sternberg (1985a, 1986): the mental capacity to 
automatize information processing and to 
emit contextually appropriate behavior in re- 
sponse to novelty; intelligence also includes 
metacomponents, performance components, 
and knowledge-acquisition components (dis- 
cussed later). 

Eysenck (1986): error-free transmission of in- 
formation through the cortex. 

Gardner (1986): the ability or skill to solve 
problems or to fashion products that are val- 
ued within one or more cultural settings. 

Ceci (1994): multiple innate abilities that serve 
as a range of possibilities; these abilities de- 
velop (or fail to develop, or develop and later 
atrophy) depending upon motivation and ex- 
posure to relevant educational experiences. 

Sattler (2001): intelligent behavior reflects the 
survival skills of the species, beyond those as- 
sociated with basic physiological processes. 


The preceding list of definitions is representa- 
tive although definitely not exhaustive. For one 
thing, the list is exclusively Western and omits sev- 
eral cross-cultural conceptions of. intelligence. 
Eastern conceptions of intelligence, for example, 
emphasize benevolence, humility, freedom from 
conventional standards of judgment, and doing 
what is right as essential to intelligence. Many 
African conceptions of intelligence place heavy 
emphasis upon social aspects of intelligence such 
as maintaining harmonious and stable intergroup 
relations (Sternberg & Kaufman, 1998). The reader 
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can consult Bracken and Fagan (1990), Sternberg 
(1994), and Sternberg and Detterman (1986) for 
additional ideas. Certainly, this sampling of views 
is sufficient to demonstrate that there appear to be 
as many definitions of intelligence as there are ex- 
perts willing to define it! 

In spite of this diversity of viewpoints, two 
themes recur again and again in expert definitions 
of intelligence. Broadly speaking, the experts tend 
to agree that intelligence is (1) the capacity to learn 
from experience, and (2) the capacity to adapt to 
one’s environment. That learning and adaptation 
are both crucial to intelligence stands out with 
poignancy in certain cases of mental disability in 
which persons fail to possess one or the other ca- 
pacity in sufficient degree (Case Exhibit 5.1). 

How well do intelligence tests capture the 
experts’ view that intelligence consists of learn- 
ing from experience and adaptation to the environ- 
ment? The reader should keep this question in 
mind as we proceed to review major intelligence 
tests in the topics that follow. Certainly, there is 
cause for concern: Very few contemporary intelli- 
gence tests appear to require the examinee to learn 
something new or to adapt to a new situation as part 
and parcel of the examination process. At best, 
prominent modern tests provide indirect measures 
of the capacities to learn and adapt. How well they 
capture these dimensions is an empirical question 
that must be demonstrated through validational 
research. 


Layperson and Expert 
Conceptions of Intelligence 


Another approach to understanding a construct is to 
study its popular meaning. This method is more sci- 
entific than it may appear. Words have a common 
meaning to the extent that they help provide an ef- 
fective portrayal of everyday transactions. If 
laypersons can agree on its meaning, a construct 
such as intelligence is in some sense “real” and 
therefore potentially useful. Thus, asking persons 
on the street, “What does intelligence mean to 
you?” has much to recommend it. 
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Sternberg, Conway, Ketron, and Bernstein 
(1981) conducted a series of studies to investigate 
conceptions of intelligence held by American 
adults. In the first study, people in a train station, 
entering a supermarket, and studying in a college li- 
brary were asked to list behaviors characteristic of 
different kinds of intelligence. In a second study— 
the only one discussed here—both laypersons and 
experts (mainly academic psychologists) rated the 
importance of these behaviors to their concept of 
an “ideally intelligent” person. 

The behaviors central to expert and lay con- 
ceptions of intelligence turned out to be very simi- 
lar, although not identical. In order of importance, 
experts saw verbal intelligence, problem-solving 
ability, and practical intelligence as crucial to 
intelligence. Laypersons regarded practical prob- 
lem-solving ability, verbal ability, and social com- 
petence to be the key ingredients in intelligence. Of 
course, opinions were not unanimous; these con- 





ceptions represent the consensus view of each 
group. The components of intelligence and repre- 
sentative descriptors are shown in Table 5.1. 

In their conception of intelligence, experts 
placed more emphasis upon verbal ability than 
problem solving, whereas laypersons reverse these 
priorities. Nonetheless, experts and laypersons 
alike consider verbal ability and problem solving to 
be essential aspects of intelligence. As the reader 
will see, most intelligence tests also accent these 
two competencies. Prototypical examples would be 
vocabulary (verbal ability) and block design (prob- 
lem solving) from the Wechsler scales, discussed 
later. We see then that everyday conceptions of in- 
telligence are, in part, mirrored quite faithfully by 
the content of modern intelligence tests. 

Some disagreement between experts and lay- 
persons is also evident. Experts consider practical 
intelligence (sizing up situations, determining how 
to achieve goals, awareness and interest in the 


142 


TABLE 5.1 
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Factors and Sample Items Underlying Conceptions of Intelligence 
for Laypersons and Experts 





Laypersons 


Practical Problem-Solving Ability 


Reasons logically and well 


Identifies connections among ideas 


Sees all aspects of a problem 
Keeps an open mind 


Verbal Ability 


Speaks clearly and articulately 


Is verbally fluent 
Converses well 


Is knowledgeable about a particular 


field of knowledge 
Social Competence 


Accepts others for what they are 


Admits mistakes 


Displays interest in the world at large 


Is on time for appointments 


Experts 


Verbal Intelligence 

Displays a good vocabulary 
Reads with high comprehension 
Displays curiosity 

Is intellectually curious 


Problem-Solving Ability 

Able to apply knowledge to problems 
at hand 

Makes good decisions 

Poses problems in an optimal way 
Displays common sense 


Practical Intelligence 

Sizes up situations well 

Determines how to achieve goals 
Displays awareness to world 
Displays interest in the world at large 





Note: For each factor, only the four items with the highest loading are listed here. Factor names were pro- 


vided by the researchers. 


Source: Reprinted with permission from Sternberg, R. J., Conway, B. E., Ketron, J. L., & Bernstein, M. 
(1981). People’s conceptions of intelligence. Journal of Personality and Social Psychology, 41, 37-55. 


world) an essential constituent of intelligence, 
whereas laypersons identify social competence (ac- 
cepting others for what they are, admitting mis- 
takes, punctuality, and interest in the world) as a 
third component. Yet, these two nominations do 
share one property in common: Contemporary tests 
generally make no attempt to measure either 
practical intelligence or social competence. Partly, 
this reflects the psychometric difficulties encoun- 
tered in devising test items relevant to these con- 
tent areas. However, the more influential reason 
intelligence tests do not measure practical intelli- 
gence or social competence is inertia: Test devel- 
opers have blindly accepted historically incomplete 
conceptions of intelligence. Until recently, the de- 
velopment of intelligence testing has been a con- 
servative affair, little changed since the days of 
Binet and the Army Alpha and Beta tests for World 
War I recruits. There are some signs that testing 


practices may soon evolve, however, with the de- 
velopment of innovative instruments. For example, 
Sternberg and colleagues have proposed innovative 
tests based upon his model of intelligence. Another 
interesting instrument based upon a new model of 
intelligence is the Everyday Problem Solving In- 
ventory (Cornelius & Caspi, 1987). In this test, ex- 
aminees must indicate their typical response to 
everyday problems such as failing to bring money, 
checkbook, or credit card when taking a friend to 
lunch. 

We turn now to a review of major theories of 
intelligence. A reminder: The justification for re- 
viewing theories is to illustrate how they have in- 
fluenced the structure and content of intelligence 
tests. In addition, the construct validity of IQ tests 
depends upon the extent to which they embody spe- 
cific theories of intelligence, so a review of theories 
is pertinent to test validation as well. 
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|| THEORIES OF INTELLIGENCE 
Galton and Sensory Keenness 


The first theories of intelligence were derived in the 
Brass Instruments era of psychology at the turn of 
the last century. The reader will recall from 
Topic 1A that Sir Francis Galton and his disciple J. 
McKeen Cattell thought that intelligence was un- 
derwritten by keen sensory abilities. This incom- 
plete and misleading assumption was based on a 
plausible premise: 


The only information that reaches us concerning 
outward events appears to pass through the avenues 
of our senses; and the more perceptive the senses 
are of difference, the larger is the field upon which 
our judgment and intelligence can act. (Galton, 
1883) 


The sensory keenness theory of intelligence 
promoted by Galton and Cattell proved to be largely 
a psychometric dead end. However, we do see ves- 
tiges of this approach in modern chronometric 
analyses of intelligence such as the Reaction 
Time—Movement Time (RT-MT) apparatus, an ex- 
perimental method favored by Jensen (1980) for the 
culture-reduced study of intelligence (Figure 5.1). 





Note: The open circles are push buttons; the crossed circles are 
green signal lights. 





FIGURE 5.1 The Reaction Time-Movement 

Time Apparatus 

Source: Reprinted with permission from Jensen, A. R. (1980). 
Bias in mental testing. New York: Free Press. Copyright © 1980 
by Arthur R. Jensen. Reprinted with permission of The Free Press, 
a Division of Simon & Schuster. 


In RT-MT studies, the subject is instructed to place 
the index finger of the preferred hand on the home 
button; then an auditory warning signal is sounded, 
followed (in 1 to 4 seconds) by one of the eight 
green lights going on, which the subject must 
turn off as quickly as possible by touching the 
microswitch button directly below it. RT is the 
time the subject takes to remove his or her finger 
from the home button after a green light goes on. 
MT is the interval between removing the finger 
from the home button and touching the button that 
turns off the green light. Jensen (1980) reported that 
indices of RT and MT correlated as high as .50 with 
traditional psychometric tests of intelligence.! P. A. 
Vernon has also reported substantial relationships— 
as high as .70 for multiple correlations—between 
speed-of-processing RT-type measures and tradi- 
tional measures of intelligence (Vernon, 1994; Ver- 
non & Mori, 1990). These findings suggest that 
speed-of-processing measures such as RT might be 
a useful addition to standardized intelligence test 
batteries. In general, test developers have resisted 
the implications of this line of research. 


Spearman and the g Factor 


Based on extensive study of the patterns of corre- 
lations between various tests of intellectual and 
sensory ability, Charles Spearman (1904, 1923, 
1927) proposed that intelligence consisted of two 
kinds of factors: a single general factor g and nu- 
merous specific factors s,, 5,, 53, and so on. As a 
necessary adjunct to his theory, Spearman helped 
invent factor analysis to aid his investigation of the 
nature of intelligence. Spearman used this statisti- 
cal technique to discern the number of separate un- 
derlying factors that must exist to account for the 
observed correlations between a large number of 
tests. 

In Spearman’s view, an examinee’s perfor- 
mance on any homogeneous test or subtest of in- 
tellectual ability was determined mainly by two 


1. Actually, the raw correlation coefficient is negative because 
faster reaction times (lower numerical scores) are associated 
with higher intelligence scores. 
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influences: g, the pervasive general factor, and s, a 
factor specific to that test or subtest. (An error fac- 
tor e could also sway scores, but Spearman sought 
to minimize this influence by using highly reliable 
instruments.) Because the specific factor s was dif- 
ferent for each intellectual test or subtest and was 
usually less influential than g in determining per- 
formance level, Spearman expressed less interest in 
studying it. He concentrated mainly on defining the 
nature of g, which he likened to an “energy” or 
“power” that serves in common the whole cortex. 
In contrast, Spearman considered s, the specific 
factor, to have a physiological substrate localized in 
the group of neurons serving the particular kind of 
mental operation demanded by a test or subtest. 
Spearman (1923) wrote, “These neural groups 
would thus function as alternative ‘engines’ into 
which the common supply of ‘energy’ could be al- 
ternatively distributed.” 

Spearman reasoned that some tests were heav- 
ily loaded with the g factor, whereas other tests— 
especially purely sensory measures—were repre- 
sentative mainly of a specific factor. Two tests each 
heavily loaded with g should correlate quite 
strongly. In contrast, psychological tests not satu- 
rated with g should show minimal correlation with 
one another. Much of Spearman’s research was 
aimed at demonstrating the truth of these basic 
propositions derived from his theory. We have il- 
lustrated these points graphically in Figure 5.2. In 
this figure, each circle represents an intelligence 
test, and the degree of overlap between circles in- 
dicates the strength of correlation. Notice that tests 





Note: Tests A and B correlate strongly, whereas C and D correlate 
weakly. See text. 





FIGURE 5.2 Spearman’s Two-Factor Theory 
of Intelligence 
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A and B, each heavily loaded on g, correlate quite 
strongly. Tests C and D have weak loadings on g 
and subsequently do not correlate well. 

Spearman (1923) believed that individual dif- 
ferences in g were most directly reflected in the 
ability to use three principles of cognition: appre- 
hension of experience, eduction of relations, and 
eduction of correlations. Incidentally, the little- 
used term eduction refers to the process of figuring 
things out. These three principles can be explained 
by examining how we solve analogies of the form 
A:B::C:? that is, A is to Bas C is to? A simple 
example might be HAMMER:NAIL::SCREW- 
DRIVER:? To solve this analogy, we must first per- 
ceive and understand each term based on past 
experience; that is, we must have apprehension of 
experience. If we have no idea what a hammer, nail, 
and screwdriver are, there is little chance we can 
complete the analogy correctly. Next, we must infer 
the relation between the first two analogy terms, in 
this case, HAMMER and NAIL. Using a somewhat 
stilted phrase, Spearman referred to the ability to 
infer the relation between two concepts as eduction 
of relations. The final step, eduction of correlates, 
refers to the ability to apply the inferred principle 
to the new domain, in this case, applying the rule 
inferred to produce the correct response, namely, 
SCREWDRIVER:SCREW. 

Although Spearman’s physiological specula- 
tions have been largely dismissed, the idea of a 
general factor has been a central topic in research 
on intelligence and is still very much alive today 
(Jensen, 1979). The correctness of the g factor 
viewpoint is more than an academic issue. If it is 
true that a single, pervasive general factor is the es- 
sential wellspring of intelligence, then psychomet- 
ric efforts to produce factorially pure subtests (e.g., 
measuring verbal comprehension, perceptual orga- 
nization, short-term memory, and so on) are largely 
misguided. To the extent that Spearman is correct, 
test developers should forego subtest derivation and 
concentrate on producing a test that best captures 
the general factor. 

The most difficult issue faced by Spearman’s 
two-factor theory is the existence of group factors. 
As early as 1906, Spearman and his contemporaries 
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noted that relatively dissimilar tests could have cor- 
relations higher than the values predicted from their 
respective g loadings (Brody & Brody, 1976). This 
finding raised the possibility that a group of diverse 
measures might share in common a unitary ability 
other than g. For example, several tests might share 
a common unitary memorization factor that was 
halfway between the g factor and the various s fac- 
tors unique to each test. Of course, the existence of 
group factors is incompatible with Spearman’s 
meticulous two-factor theory. 


Thurstone and the Primary Mental Abilities 


Thurstone (1931) developed factor-analysis proce- 
dures capable of searching correlation matrices for 
the existence of group factors. His methods per- 
mitted a researcher to discover empirically the 
number of factors present in a matrix and to define 
each factor in terms of the tests that loaded on it. In 
his analysis of how scores on different kinds of 
intellectual tests correlated with each other, Thur- 
stone concluded that several broad group factors— 
and not a single general factor—could best explain 
empirical results. At various points in his research 
career, he proposed approximately a dozen differ- 
ent factors. Only seven of these factors have been 
frequently corroborated (Thurstone, 1938; Thur- 
stone & Thurstone, 1941) and they have been des- 
ignated primary mental abilities (PMAs). They 
are as follows: 


Verbal Comprehension: The best measure is vo- 
cabulary, but this ability is also involved in read- 
ing comprehension and verbal analogies. 

Word Fluency: Measured by such tests as ana- 
grams or quickly naming words in a given cate- 
gory (e.g., foods beginning with the letter S$). 
Number: Virtually synonymous with the speed 
and accuracy of simple arithmetic computation. 
Space: Such as the ability to visualize how a 
three-dimensional object would appear if it was 
rotated or partially disassembled. 

Associative Memory: Skill at rote memory tasks 
such as learning to associate pairs of unrelated 
items. 


e Perceptual Speed: Involved in simple clerical 
tasks such as checking for similarities and dif- 
ferences in visual details. 

e Inductive Reasoning: The best measures of this 
factor involve finding a rule, as in a number se- 
ries completion test. 


Thurstone (1938) published the Primary Men- 
tal Abilities Test consisting of separate subtests, 
each designed to measure one PMA. However, he 
later acknowledged that his primary mental abili- 
ties correlated moderately with each other, proving 
the existence of one or more second-order factors. 
Ultimately, Thurstone acknowledged the existence 
of g as a higher-order factor. By this time, Spear- 
man had admitted the existence of group factors 
representing special abilities, and it became appar- 
ent that the differences between Spearman and 
Thurstone were largely a matter of emphasis (Brody 
& Brody, 1976). Spearman continued to believe that 
g was the major determinant of correlations be- 
tween test scores and assigned a minor role to group 
factors. Thurstone reversed these priorities. 

P. E. Vernon (1950) provided a rapprochement 
between these two viewpoints by proposing a hier- 
archical group factor theory. In his view, g was the 
single factor at the top of a hierarchy that included 
two major group factors labeled verbal-educational 
(V:ed) and practical-mechanical-spatial-physical 
(k:m). Underneath these two major group factors 
were several minor group factors resembling the 
PMAs of Thurstone; specific factors occupied the 
bottom of the hierarchy (Figure 5.3). 
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FIGURE 5.3 Vernon’s Hierarchical Group Factor 
Theory of Intelligence 

Source: Reprinted with permission from Vernon, P. E. (1950). 
The structure of human abilities. London: Methuen. 


Specific factors 
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Thurstone’s analysis of PMAs continues to in- 
fluence test development even today. Schaie (1983, 
1985) has revised and modified the Primary Men- 
tal Abilities Test and used these measures in an 
enormously influential longitudinal study of adult 
intelligence. If intelligence were mainly a matter of 
g, then the group factors should change at about the 
same rate with aging. In support of the group fac- 
tor approach to intellectual testing, Schaie (1983) 
reports that some PMAs show little age-related 
decrement (Verbal Comprehension, Word Fluency, 
Inductive Reasoning), whereas other PMAs decline 
more rapidly in old age (Space, Number). Thus, 
there may be practical real-world reasons for re- 
porting group factors and not condensing all of in- 
telligence into a single general factor. 


R. Cattell and the Fluid/Crystallized Distinction 


Raymond Cattell (1941, 1971) proposed an influ- 
ential theory of the structure of intelligence that has 
been revised and extended by John Horn (1968, 
1994). As did their predecessors, Cattell and Horn 
used factor analysis to study the structure of intel- 
ligence. But instead of finding a single general fac- 
tor or a half dozen group factors, Cattell and Horn 
identified two major factors, which they labeled 
fluid intelligence (8 and crystallized intelligence 


8): 

- Fluid intelligence is a largely nonverbal and 
relatively culture-reduced form of mental effi- 
ciency. It is related to a person’s inherent capacity 
to learn and solve problems. Thus, fluid intelli- 
gence is used when a task requires adaptation to a 
new situation. By contrast, crystallized intelligence 
represents what one has already learned through the 
investment of fluid intelligence in cultural settings 
(e.g., learning algebra in school). Crystallized in- 
telligence is highly culturally dependent and is 
used for tasks that require a learned or habitual re- 
sponse. Since crystallized intelligence arises when 
fluid intelligence is applied to cultural products, we 
would expect these two kinds of intelligence to be 
correlated. In fact, it is commonly found that mea- 
sures of crystallized and fluid intelligence correlate 
moderately (r = .5). 
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The abilities that make up fluid intelligence are 
nonverbal and not heavily dependent upon expo- 
sure to a specific culture. For these reasons, Cattell 
(1940) believed that measures of fluid intelligence 
were culture-free. Based on this assumption, he 
devised the Culture Fair Intelligence Test in an 
attempt to eliminate cultural bias in testing. Of 
course, calling a test culture-fair does not make it 
necessarily so. In fact, the goal of a completely 
culture-free intelligence test has proved elusive. We 
discuss the CFIT in more detail in Topic 6B, Group 
Tests of Intelligence. 

In later versions of the fluid/crystallized theory 
of intelligence, Cattell (1971) and Horn (1982, 
1994) expanded and elaborated on the previously 
discussed concepts. Today their approach might 
better be called a theory of many intelligences, but 
the & 8. designation has become so well known 
that it will not easily be phased out. In the latest re- 
visions, the authors have proposed a hierarchical, 
interlocking model of intelligence with fluid and 
erystallized components at the top. These capaci- 
ties are subserved by identified subcomponents of 
intelligence, including visual organization, percep- 
tual speed, auditory organization, several memory 
capacities, and specific sensory reception compo- 
nents as well. The revised model is labyrinthine; in- 
terested readers should consult Horn (1994). 


Piaget and Adaptation 


The Swiss psychologist Jean Piaget (1896-1980) 
devised a theory of cognitive development that has 
a number of implications for the design of chil- 
dren’s intelligence tests (Ginsburg & Opper, 1988). 
Piaget (1926, 1952, 1972) used interviews and in- 
formal tests with ‘children to develop a series of 
provocative and revolutionary views about intel- 
lectual development. His new perspective included 
the following points: 


e Children’s thought is qualitatively different from 
adults’ thought. 

e Psychological structures called schemas are the 
primary basis for gaining new knowledge about 
the world. 
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e Four stages of cognitive development can be 
identified. 


We examine each of these points in more detail in 
the following. 

By studying the development of conservation, 
Piaget concluded that a child’s construction of the 
world is fundamentally different from the adult per- 
spective. Conservation refers to the awareness that 
physical quantities do not change in amount when 
they are superficially altered in appearance. For ex- 
ample, most adults know that two matching rows of 
10 pennies are still equivalent if one row is spread 
out—adults possess conservation of number. But a 
young child will be easily misled by the superficial 
change in appearance and may insist that the sec- 
ond row now has more pennies. In a similar man- 
ner, it can be shown that young children do not 
possess conservation of continuous quantity, sub- 
stance, weight, or volume. 

In order to explain how infants and children 
gain new knowledge about the world, Piaget sug- 
gested that they form psychological structures 
called schemas. A schema is an organized pat- 
tern of behavior or a well-defined mental struc- 
ture that leads to knowing how to do something. 
Perhaps a few examples will help clarify this diffi- 
cult concept. Young infants possess schemas that 
are mainly sensorimotor in nature, such as the 
grasp-and-pull schema that allows a baby to re- 
trieve a desired object and bring it up to the mouth. 
As we get older, we add mental structures to our 
collection of sensorimotor schemas. For example, 
teenagers usually possess the alphabetizing schema 
that permits them to find a word in a dictionary 
by repeatedly applying the simple rule that entries 
are alphabetical by first letter, then second letter, 
and so on. 

Piaget’s genius was in suggesting a mechanism 
by which schemas evolve toward greater and 
greater levels of complexity, thereby transforming 
into the more mature level of intellectual skill ob- 
served in most adults. The mechanism by which 
schemas become more mature is called the process 
of equilibration. To understand equilibration, the 
reader needs to know three additional Piagetian 


concepts: assimilation, accommodation, and equi- 
librium. 

Assimilation is the application of a schema to 
an object, person, or event. For example, assimila- 
tion is involved when an infant uses the grasp-and- 
pull schema to retrieve a baby rattle and bring it to 
the mouth. If assimilation works to achieve the de- 
sired goals of the person, a state of harmony or equi- 
librium exists. But what happens if the application 
of the schema doesn’t work? Suppose the grasp- 
and-pull schema is unsuccessful because the baby 
rattle snags on the vertical side bars of the crib as 
the infant seeks to bring the toy to the mouth. A state 
of dynamic tension will then arise, requiring the in- 
fant to adjust the schema so that it works. The ad- 
justment of an unsuccessful schema so that it works 
is called accommodation. In our example of the in- 
fant using the grasp-and-pull schema to retrieve a 
baby rattle, the schema might be modified and be- 
come the grasp-and-pull-and-turn schema. If the 
modified schema is successful and allows the infant 
to bring the rattle to the mouth, a state of equilib- 
rium exists once again. Note the distinction be- 
tween equilibrium, the state of temporary harmony, 
and equilibration, the entire process of assimilation, 
accommodation, and equilibrium. Piaget believed 
that the striving toward equilibrium was an inher- 
ited characteristic of the human species. 

Piaget also proposed four stages of cognitive 
development. According to his view, each stage is 
qualitatively different from the others and charac- 
terized by distinctive patterns of thought (Table 
5.2). In the next topic (5B, Assessment of Infant 
and Preschool Abilities), we discuss an infant test 
based on a Piagetian analysis of cognitive develop- 
ment. In general, tests based upon these concepts 
seek to ascertain whether a child has passed certain 
cognitive milestones (e.g., conservation of volume) 
proposed by Piaget. 


Guilford and the Structure-of-Intellect Model 


After World War II, J. P. Guilford (1967, 1985) con- 
tinued the search for the factors of intelligence that 
had been initiated by Thurstone. Guilford soon 
concluded that the number of discernible mental 
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TABLE 5.2 Piaget’s Stages of Cognitive Development 


Stage and Age Span 


Sensorimotor: 
birth to 2 years 


Characteristics of Thought 


Infants experience the world mainly through their senses and 
motor abilities, act as if an object ceases to exist if it is not in 


sight, but develop object permanence by the end of this stage. 


Preoperational: 
2 to 6 years 


Conservation concepts not yet developed, but these children do 
understand the idea of a functional relationship—for example, 


you pull on a cord to open a curtain, and the farther you pull, the 
more the curtain opens. Ability to mentally symbolize things 
with words and images also develops. 


Concrete Operational: 
7 to 12 years 


Children typically develop conservation and demonstrate lim- 
ited capacities of logical reasoning. For example, concept of re- 


versibility develops—the knowledge that one action can reverse 
or negate another. 


Formal Operational: 
12 years and up 


The systematic problem solving that we associate with adult 
thought usually develops in this stage. There is a greater capac- 


ity to generate hypotheses and test them. 





abilities was far in excess of the seven proposed by 
Thurstone. For one thing, Thurstone had ignored 
the category of creative thinking entirely, an un- 
warranted oversight in Guilford’s view. Guilford 
also found that if innovative types of tests were in- 
cluded in the large batteries of tests he administered 
his subjects, then the pattern of correlations be- 
tween these tests indicated the existence of literally 
dozens of new factors of intellect. Furthermore, 
Guilford noticed that some of these new factors had 
recurring similarities with respect to the kinds of 
mental processes involved, the kinds of information 
featured, or the form that the items of information 
took. As a result of these recurring similarities in 
the newly discovered factors of intellect, he became 
convinced that these multitudinous factors could be 
grouped along a small number of main dimensions. 
Guilford (1967) proposed an elegant structure-of- 
intellect (SOI) model to summarize his findings. 
Visually conceived, Guilford’s SOI model classi- 
fies intellectual abilities along three dimensions 
called operations, contents, and products. 

By operations, Guilford has in mind the kind of 
intellectual operation required by the test. Most test 
items emphasize just one of the operations listed 
here: 


Cognition Discovering, knowing, or 
comprehending 

Memory Committing items of information to 
memory, such as a series of numbers 

Divergent Retrieving from memory items of 

production a specific class, such as naming 


objects that are both hard and edible 
Convergent Retrieving from memory a correct 
production item, such as a crossword puzzle 
word 
Determining how well a certain item 
of information satisfies specific 
logical requirements 


Evaluation 


Contents refers to the nature of the materials or 
information presented to the examinee. The five 
content categories are as follows: 


Visual Images presented to the eyes 

Auditory Sounds presented to the ears 

Symbolic Such as mathematical symbols 
that stand for something 

Semantic Meanings, usually of word 
symbols 

Behavioral The ability to comprehend the 
mental state and behavior of 


other persons 
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The third dimension in Guilford’s model, prod- 
ucts, refers to the different kinds of mental struc- 
tures that the brain must produce to derive a correct 
answer. The six kinds of products are as follows: 


Unit A single entity having a unique 
combination of properties or 
attributes 

Class What it is that similar units have 


in common, such as a set of 
triangles or high-pitched tones 


Relation An observed connection be- 
i tween two items, such as two 
tones an octave apart 
System Three or more items forming a 


recognizable whole, such as a 

melody or a plan for a sequence 

of actions 
Transformation A change in an item of 
information, such as a 
correction of a misspelling 
What an individual item implies, 
such as to expect thunder 
following lightning 


In total, then, Guilford (1985) identified five 
types of operations, five types of content, and six 
types of products, for a total of 5 x 5 x 6 or 150 fac- 
tors of intellect. Each combination of an operation 
(e.g., memory), a content (e.g., symbolic), and a 
product (e.g., units) represents a different factor of 
intellect. Guilford claims to have verified over 100 
of these factors in his research. 

The SOI model is often lauded on the grounds 
that it captures the complexities of intelligence. 
However, this is also a potential Achilles’ heel for 
the theory. Consider one factor of intellect, mem- 
ory for symbolic units. A test that requires the ex- 
aminee to recall a series of spoken digits (e.g., Digit 
Span on the WAIS-III) might capture this factor of 
intellect quite well. But so might a visual digit span 
test and perhaps even an analogous test with tactile 
presentation of symbols, such as vibrating rods ap- 
plied to the skin. Perhaps we need a separate cube 
for hearing, vision, and touch; such an expanded 
model would incorporate 450 factors of intellect, 
surely an unwieldy number. 


Implication 


Although it seems doubtful that intelligence 
could involve such a large number of unique abili- 
ties, Guilford’s atomistic view of intellect nonethe- 
less has caused test developers to rethink and widen 
their understanding of intelligence. Prior to Guil- 
ford’s contributions, most tests of intelligence 
required mainly convergent production—the con- 
struction of a single correct answer to a stimulus 
situation. Guilford raised the intriguing possibility 
that divergent production—the creation of nu- 
merous appropriate responses to a single stimulus 
situation—is also an essential element of intelligent 
behavior. Thus, a question such as “List as many 
consequences as possible if clouds had strings 
hanging down from them” (divergent production) 
might assess an aspect of intelligence not measured 
by traditional tests. 


Theory of Simultaneous 
and Successive Processing 


Some modern conceptions of intelligence owe a debt 
to the neuropsychological investigations of the Russ- 
ian psychologist Aleksandr Luria (1902-1977). 
Luria (1966) relied primarily upon individual case 
studies and clinical observations of brain-injured sol- 
diers to arrive at a general theory of cognitive pro- 
cessing. The heart of his theory is as follows: 


Analysis shows that there is strong evidence for 
distinguishing two basic forms of integrative activ- 
ity of the cerebral cortex by which different aspects 
of the outside world may be reflected. . . . The first 
of these forms is the integration of the individual 
stimuli arriving in the brain into simultaneous, and 
primarily spatial groups, and the second is the inte- 
gration of individual stimuli arriving consecutively 
in the brain into temporally organized, successive 
series. (Luria, 1966) 


Since this approach focuses upon the mechanics by 
which information is processed, it is often called an 
information-processing theory. 

Simultaneous processing of information is 
characterized by the execution of several different 
mental operations simultaneously. Forms of think- 
ing and perception that require spatial analysis, 
such as drawing a cube, require simultaneous 
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information processing. In drawing, the examinee 
must simultaneously apprehend the overall shape 
and guide hand and fingers in the execution of the 
shape. A sequential approach to drawing a cube (if 
one were even possible) would be horrifically com- 
plex. In effect, the examinee would have to draw 
individual lines of highly specific lengths and an- 
gular orientations, and just hope that everything 
would line up. In the absence of a simultaneous 
mental gestalt to guide the drawing, a distorted pro- 
duction is almost guaranteed. Luria discovered that 
simultaneous processing is associated with the oc- 
cipital and parietal lobes in the back of the brain. 

Successive processing of information is 
needed for mental activities in which a proper se- 
quence of operations must be followed. This is in 
sharp contrast to simultaneous processing (such as 
drawing), for which sequence is unimportant. Suc- 
cessive processing is needed in remembering a se- 
ries of digits, repeating a string of words (e.g., shoe, 
ball, egg), and imitating a series of hand move- 
ments (fist, palm, fist, fist, palm). Luria localized 
successive processing to the temporal lobe and the 
frontal regions adjacent to it. 

Most forms of information processing require 
an interplay of simultaneous and successive mech- 
anisms. Das (1994) cites the example of reading an 
unfamiliar word such as taciturn: 


The single letters are to be recognized, and that in- 
volves simultaneous coding. The reader matches 
the visual shape of the letter with a mental dictio- 
nary and comes up with a name for it. The letter se- 
quences, then, have to be formed (successive 
coding) and blended together as a syllable (simul- 
taneous). Then the string of syllables has to be 
made into a word (successive), the word is recog- 
nized (simultaneous), and a pronunciation program 
is then assembled (successive), leading to oral 
reading (successive and simultaneous). 


Das admits that this may bea simplified view of what 
occurs when a reader is confronted with a word. The 
essential point is that higher-level information pro- 
cessing relies upon an interplay of specific, anatom- 
ically localizable forms of information processing. 
The challenge of a simultaneous-successive ap- 
proach to the assessment of intelligence is to design 
tasks that tap relatively pure forms of each ap- 
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proach to information processing. Tests that use 
this strategy are the Kaufman Assessment Battery 
for Children (K-ABC), discussed in the next topic, 
and the Das-Naglieri Cognitive Assessment System 
(Das & Naglieri, 1993). The Das-Naglieri battery 
includes successive tasks that involve rapid articu- 
lation (such as, “Say can, ball, hot as fast as you 
can 10 times”) and simultaneous measures of both 
verbal and nonverbal tasks. The battery also as- 
sesses planning and attention, which leads to the 
acronym PASS (planning, attention, simultaneous, 
successive) (Das, Naglieri, & Kirby, 1994). 


Information-Processing Theories 
of Intelligence 


Information-processing conceptions of intelligence 
propose models of how individuals mentally repre- 
sent and process information. Borrowing from 
Campione and Brown (1978), Borkowski (1985) 
has put forward a comprehensive theory that bears 
a loose analogy to the functioning of a computer. 
The architectural system (hardware) refers to bi- 
ologically based properties necessary for informa- 
tion processing, such as memory span and speed of 
encoding/decoding information. Properties of the 
architectural system include capacity (e.g., number 
of slots in short-term memory, capacity of long- 
term memory), durability (rate of information loss), 
and efficiency of operation (e.g., rate of memory 
search). The architectural system is considered to 
be relatively “hard-wired” and impervious to 
change by the environment. 

In addition to the structural component of intel- 
ligence, there are various functional components 
(software). The executive system, which refers to 
environmentally learned components that steer 
problem solving, provides overall guidance to the 
functional components. Elements of the executive 
system include the knowledge base (retrieval of 
knowledge from long-term memory), schemes 
(rules of thinking), control processes (rules and 
strategies such as self-checking and rehearsal), and 
metacognition (self-awareness of one’s own thought 
processes). Metacognition is the process of think- 
ing about thinking. Flavell (1976), who pioneered 
research on this topic, explained it as follows: 
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Metacognition refers to one’s knowledge concerning 
one’s own cognitive processes or anything related to 
them, e.g., the learning-relevant properties of infor- 
mation or data. For example, I am engaging in 
metacognition if I notice that I am having more trou- 
ble learning A than B; if it strikes me that I should 
double check C before accepting it as fact. (p. 232) 


The information-processing approach to intelli- 
gence has generated a large body of research, espe- 
cially on the concept of metacognition. A consistent 
finding in this literature is that individuals who use 
metacognitive strategies perform at much higher 
levels than those who do not (Montague & Bos, 
1990). For example, in a study of 32 Israeli kinder- 
garten children who were taught metacognition re- 
lated to mathematics, metacognitive skills explained 
more of the variance in mathematics performance 
than general ability (Mevarech, 1995). Metacogni- 
tion is essential to intelligence and is one of the pri- 
mary influences on student learning (Wang, Haertel, 
& Walberg, 1990). 


Intelligence as a Biological Construct 


Most investigators have studied intelligence in the 
traditional manner by developing tests of intellect 
and correlating scores with external criteria (e.g., 
school grades) or other test results. But a few re- 
searchers have sought to discern the nature of in- 
telligence by looking at the properties of the brain 
itself. For example, Hynd and Willis (1985) provide 
an excellent survey of the neurological foundations 
of intelligence. 

One important property of the brain required 
for intelligent behavior is the well-patterned and 
synchronized electrical activity of brain cells. 
Neurons must transmit precisely calibrated elec- 
trochemical impulses in order for sensation, per- 
ception, and higher thought processes to occur. The 
collective electrical activity of brain cells can be 
measured by placing electrodes on a person’s scalp. 
The ongoing record of electrical activity shows 
spontaneous fluctuations over time but also demon- 
strates predictable patternings in response to cer- 
tain stimuli. For example, an evoked potential can 
be measured by noting the pattern of brain waves 
that occurs in the quarter second or so after a light 


is flashed in a subject’s eyes. An average evoked 
potential (AEP) is usually obtained from hundreds 
of such trials for a single individual. In this manner, 
an extremely consistent and distinctive pattern can 
be obtained for any individual. 

Ertl and Schafer (1969) were among the first re- 
searchers to study the brain wave correlates of in- 
telligence. They discovered that the waveform of the 
AEP has many more peaks and troughs for high-IQ 
subjects than for low-IQ subjects. Eysenck (1982) 
published similar findings, which we have repro- 
duced here (Figure 5.4). Two colleagues of Eysenck, 
A. E. Hendrickson (1982), and D. E. Hendrickson 
(1982) noticed that the total length of the sinuous 
waveform of the AEP could be used as a biological 
index of intelligence. They laid a piece of string over 
each of the AEP waveforms reported by Ertl and 
Shafer (1969). The beginnings and ends of the 
strings were cut, the strings were tightly stretched 
into straight lines, then measured for length. The re- 
searchers were then able to compute the correlation 
between the string lengths and the published IQ 
scores. The result was an impressive value of r=.77. 
This correlation is as high as those reported between 
any two psychometric tests of intelligence. A purely 
biological measure of brain function (AEP waves) 
turns out to be an excellent predictor of intelligence 
as measured by traditional IQ tests. 

In spite of these promising research findings, 
several investigators remain skeptical about the 
electrocortical correlates of intelligence. The corre- 
lations arise only under certain conditions, and at- 
tempts to replicate the results do not always succeed 
(Eysenck, 1994; Vernon & Mori, 1990). Gale and 
Edwards (1983) argue that mere correlational stud- 
ies are not enough; we need a more theory-bound 
orientation that links intelligence as a trait with 
information processing at the neural level. Efforts 
to formulate such a theory have been attempted 
(Deary, Hendrickson, & Burns, 1987). These and 
similar studies (e.g., Shucard & Horn, 1972) serve 
as a reminder that intelligence is somehow bound 
up in the physiological properties of the brain; even 
though we don’t yet understand the precise biolog- 
ical characteristics that account for intelligence. 

Haier and his colleagues have pursued a differ- 
ent path in their study of biological intelligence 
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(Haier, Nuechterlein, Hazlett, and others, 1988; 
Haier, Siegel, Tang, and others, 1992). They mea- 
sured cortical glucose metabolic rates as revealed 
by positron emission tomography (PET) scan 
analysis of volunteers solving intellectual prob- 
lems. Brain cells use glucose and oxygen for fuel, 
so a PET scan will reveal “hot spots” at the most 
active brain sites (where glucose is being metabo- 
lized). Intriguingly, more-intelligent persons 
showed less brain activity when solving geometric 
analogy problems and when playing the Tetris 
computer game than less-intelligent persons. What 
remains unclear in this line of research is the causal 
direction: Are people smart because they use less 
glucose or do they use less glucose because they 
are smart? Another possibility is that both high 
IQ and low glucose metabolism are related to 
a third causal variable (Sternberg & Kaufman, 
1998). 


Gardner and the Theory 
of Multiple Intelligences 


Howard Gardner (1983, 1993) has proposed a the- 
ory of multiple intelligences based loosely upon the 
study of brain-behavior relationships. He argues 
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for the existence of several relatively independent 
human intelligences, although he admits that the 
exact nature, extent, and number of the intelli- 
gences has not yet been definitively established. 
Gardner (1983) outlines the criteria for an au- 
tonomous intelligence as follows: 


Potential isolation by brain damage—the faculty 
can be destroyed, or spared in isolation, by brain 
injury. 

Existence of exceptional individuals such as sa- 
vants—the faculty is uniquely spared in the midst 
of general intellectual mediocrity. 

Identifiable core operations—the faculty relies 
upon one or more basic information-processing 
operations. 

Distinctive developmental history—the faculty 
possesses an identifiable developmental history, 
perhaps including critical periods and milestones. 
Evolutionary plausibility—admittedly specula- 
tive, a faculty should have evolutionary an- 
tecedents shared with other organisms (e.g., 
primate social organization). 

Support from experimental psychology—the 
faculty emerges in laboratory studies in cognitive 
psychology. 
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Support from psychometric findings—the fac- 
ulty reveals itself in measurement studies and is 
susceptible to psychometric measurement. 
Susceptibility to symbol encoding—the faculty 
can be communicated via symbols including 
(but not limited to) language, picturing, and 
mathematics. 


Based upon these criteria, Gardner (1983, 1993) 
proposes that the following seven natural intelli- 
gences have been substantially confirmed. The seven 
intelligences are linguistic, logical-mathematical, 
spatial, musical, bodily-kinesthetic, interpersonal, 
and intrapersonal. Three of these seven types of in- 
telligence are well known—linguistic (i.e., verbal) 
intelligence, logical-mathematical intelligence, spa- 
tial intelligence—and numerous formal tests have 
been devised to measure them, so we will not discuss 
them further here. The other four variations of intel- 
ligence are somewhat novel and therefore require 
more detailed presentation. 

Bodily-kinesthetic intelligence includes the 
types of skills used by athletes, dancers, mime 
artists, typists, or “primitive” hunters. Although 
Western cultures are generally loath to consider the 
body as a form of intelligence, this is not the case 
in much of the rest of the world, nor was it true in 
our evolutionary history. Indeed, persons who could 
skillfully avoid predators, climb trees, hunt animals, 
and prepare tools were more likely to survive and 
pass on their genes to succeeding generations. 

The personal intelligences include the capacity 
to have access to one’s own feeling life (intraper- 
sonal) as well as the ability to notice and make 
distinctions about the moods, temperaments, moti- 
vations, and intentions of others (interpersonal). 
Thus, personal intelligence encompasses both an in- 
trapersonal and an interpersonal version. The former 
is found in great novelists who can write introspec- 
tively about their feelings, while the latter is often 
seen in religious and political leaders (e.g., Mahatma 
Ghandi or Lyndon Johnson) who can fathom the in- 
tentions and desires of others and use this informa- 
tion to influence them and form useful alliances. 

Musical intelligence is perhaps the least under- 
stood of Gardner’s intelligences. Persons with good 
musical intelligence easily learn to perform an in- 


strument or to write their own compositions. Al- 
though knowledge of the structural aspects of 
melody, rhythm, and timbre is important to musi- 
cal intelligence, Gardner notes that many experts 
place the affective or feeling aspects of music at its 
core. He believes that when the neurological un- 
derpinnings of music are finally unraveled, we will 
have “an explanation of how emotional and moti- 
vational factors are intertwined with purely per- 
ceptual ones” (Gardner, 1983). 

The savant phenomenon provides strong sup- 
port for the existence of separate intelligences, 
including musical intelligence. A savant is a men- 
tally deficient individual who has a highly devel- 
oped talent in a single area such as art, rapid 
calculation, memory, or music. An example is the 
extraordinary case of Leslie Lemke, who was born 
blind, and with mental retardation and cerebral 
palsy. He was not supposed to live. His adoptive 
mother had to coax him to suck milk from a bottle. 
Later, she strapped him to her back to help him 
learn to walk. In spite of his severe disabilities, 
Leslie became enamored of the piano and showed 
incredible precocity at picking out melodies on it. 
Within a few years, at the age of 18, he could listen 
to a piece of classical piano music a single time and 
then play it back flawlessly (Patton, Payne, & 
Beirne-Smith, 1986). The reader can find addi- 
tional savant case studies in Miller (1989) and 
Treffert (1989). 

Recently, Gardner (1998) has added three ten- 
tative candidates to his list of intelligences. These 
are naturalistic, spiritual, and existential intelli- 
gences. Naturalistic intelligence is the kind shown 
by people who are able to discern patterns in na- 
ture. Charles Darwin would be a prime example of 
such a person. Gardner believes that the evidence 
for this kind of intelligence is relatively strong. In 
contrast, spiritual intelligence (a concern with cos- 
mic and spiritual issues in one’s development) and 
existential intelligence (a concern with ultimate is- 
sues, including the meaning of life) are less well 


2. Historically, savants have also been called idiot savants, 
which refers, literally, to a person who is both profoundly re- 
tarded and yet “wise” at the same time. For obvious reasons, the 
prefix has been dropped. 


154 CHAPTERS 


proved as independent intelligences. In general, the 
theory of multiple intelligence is compelling in its 
simplicity, but there is little empirical investigation 
of its validity. 


Sternberg and the Triarchic 
Theory of Intelligence 


Sternberg (1985b, 1986, 1996) takes a much wider 
view on the nature of intelligence than most previ- 
ous theorists. In addition to proposing that certain 
mental mechanisms are required for intelligent be- 
havior, he also emphasizes that intelligence in- 
volves adaptation to the real-world environment. 
His theory emphasizes what he calls successful in- 
telligence or “the ability to adapt to, shape, and se- 
lect environments to accomplish one’s goals and 
those of one’s society and culture” (Sternberg & 
Kaufman, 1998, p. 494). . 

Sternberg’s theory is called triarchic (ruled by 
three) because it deals with three aspects of intelli- 
gence: componential intelligence, experiential intel- 
ligence, and contextual intelligence. Each of these 
types of intelligence has two or more subcompo- 
nents. The entire theory is outlined in Table 5.3. 


TABLE 5.3 An Outline of Sternberg’s Triarchic 
Theory of Intelligence 


Componential Intelligence 
Metacomponents or executive processes (e.g., 
planning) 
Performance components (e.g., syllogistic reasoning) 
Knowledge-acquisition components (e.g., ability to 
acquire vocabulary words) 
Experiential Intelligence 
Ability to deal with novelty 
Ability to automatize information processing 
Contextual Intelligence 
Adaptation to real-world environment 
Selection of a suitable environment 
Shaping of the envirenment 





Source: Summarized from Sternberg, R. J. (1986). Intelligence 
applied: Understanding and increasing your intellectual skills. 
San Diego, CA: Harcourt Brace Jovanovich. 
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Componential intelligence consists of the in- 
ternal mental mechanisms that are responsible for 
intelligent behavior. The components of intelli- 
gence serve three different functions. Metacompo- 
nents are the executive processes that direct the 
activities of all the other components of intelli- 
gence. They are responsible for determining the na- 
ture of an intellectual problem, selecting a strategy 

_ for solving it, and making sure that the task is com- 
pleted. The metacomponents receive constant feed- 
back as to how things are going in problem solving. 
Persons who are strong on the metacomponential 
aspect of intelligence are very good at allocating 
their intellectual resources. 

In a problem-solving study using novel forms 
of analogies, Sternberg (1981) found that higher in- 
telligence is associated with spending relatively 
more time on global or higher-order planning, and 
relatively less time on local or lower-order plan- 
ning. For example, consider this analogy problem: 


Man: Skin:: (Dog, Tree):(Bark, Cat) 


The examinee must choose the two correct terms 
on the right that will complete the analogy. (The 
correct choices are Tree and Bark). Using reaction 
time measures for a series of such novel or nonen- 
trenched problems, Sternberg (1981) found that 
persons of higher intelligence spend more time in 
global planning—forming a macrostrategy that ap- 
plies to this and similar problems—than did per- 
sons of lower intelligence. Thus, a crucial aspect of 
intelligence is knowing when to step back and al- 
locate intellectual effort instead of obtusely attack- 
ing a difficult problem. i 
Performance components are the well- 
entrenched mental processes that might be used to 
perform a task or solve a problem. These aspects of 
intelligence are the ones that are probably measured 
the best by existing intelligence tests. Examples of 
performance components include short-term mem- 
ory and syllogistic reasoning. 
Knowledge-acquisition components are the 
processes used in learning. Sternberg has empha- 
sized that in order to understand what makes some 
people more skilled than others, we must under- 
stand their increased capacity to acquire those skills 


TOPIC 5A THEORIES AND THE MEASUREMENT OF INTELLIGENCE 155 


in the first place. A case in point is vocabulary 
knowledge, which is learned mainly in context 
rather than through direct instruction. More- 
intelligent persons are better able to use surround- 
ing contexts to figure out what a word means; that 
is, they have greater knowledge-acquisition skills. 
Their increased vocabulary results, in large mea- 
sure, from their increased ability to “soak up” the 
meanings of words they see and hear in their envi- 
ronment. Thus, vocabulary is an excellent measure 
of intelligence because it reflects people’s ability to 
acquire information in context. 

The second aspect of Sternberg’s theory in- 
volves experiential intelligence. According to the 
theory, a person with good experiential intelligence 
is able to deal effectively with novel tasks. This as- 
pect of his theory explains why Sternberg is so crit- 
ical of most intelligence tests. For the most part, the 
existing tests measure things already learned by pre- 
senting tasks that the subject has already encoun- 
tered. According to Sternberg, intelligence also 
involves the capacity to learn and think within new 
conceptual systems, not just to deal with tasks al- 
ready encountered. A second aspect of experiential 
intelligence is the ability to automatize or “make 
routine” tasks that are encountered. repeatedly. An 
example of automatizing that applies to most of us 
is reading, which is carried out largely without con- 
scious thought. But any task or mental skill can be 
automatized, if it is practiced enough. Playing music 
is an example of an extremely high-level skill that 
can become automatized with enough practice. 

The third aspect of Sternberg’s theory involves 
contextual intelligence. Contextual intelligence is 
defined as “mental activity involved in purposive 
adaptation to, shaping of, and selection of real- 
world environments relevant to one’s life” (Stern- 
berg, 1986, p. 33). This aspect of Sternberg’s theory 
appears to acknowledge that human behavior has 
been shaped by selective pressures during our evo- 
lutionary history. Contextual intelligence has three 
parts: adaptation, selection, and shaping. 

Adaptation refers to developing skills required 
by one’s particular environment. Successful adap- 
tation will differ from one culture to the next. In the 
pygmy cultures of Africa, adaptation might involve 


the ability to track elephants and kill them with 
poison-tipped spears. In the Western industrial na- 
tions, adaptation might involve presenting oneself 
favorably in a job interview. 

Selection might be called niche finding. This as- 
pect of contextual intelligence involves the ability 
to leave the environment we are in and to select a 
different environment more suitable to our talents 
and needs. Feldman (1982) has illustrated how se- 
lection can operate in the career choices of gifted 
children, thereby determining whether they are 
highly accomplished as adults. She followed up on 
the Quiz Kids who were featured in radio and tele- 
vision shows of the 1950s. These were extremely 
bright children by conventional standards, most 
with IQs of 140 and higher. A few became highly 
successful as adults. However, most of them led 
rather ordinary lives, devoid of the spectacular ac- 
complishments that might have been predicted from 
their childhood precocity. Those who were most 
successful had found occupations highly suited to 
their abilities and interests. In sum, they had se- 
lected environmental niches that fitted them well. 
Sternberg would argue that the ability to select such 
environments is an important aspect of intelligence. 

Shaping is another way to improve the fit 
between oneself and the environment, especially 
when selection of a new environment is not practi- 
cal. In this application of contextual intelligence, 
we shape the environment itself so that it better fits 
our needs. An employee who convinces the boss to 
do things differently has used shaping to make the 
work environment more suited to his or her talents. 

Although Sternberg’s triarchic theory is the most 
comprehensive and ambitious model yet proposed, 
not all psychometric researchers have rushed to em- 
brace it. Detterman (1984) cautions that we should 
investigate the basic cognitive components of intel- 
ligence before introducing higher-order constructs 
that may be unnecessary. Rogoff (1984) questions 
whether the three subtheories (componential, expe- 
riential, contextual) are sufficiently linked. Other 
comments on the triarchic theory can be found in Be- 
havioral and Brain Sciences (1984, pp. 287-304). 

Whatever the final verdict on the triarchic 
theory of intelligence, Sternberg’s insistence that 
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intelligence has several components not measured by 
traditional tests rings true to anyone who has studied 
or administered these tests. He cites the case of a col- 
league who was asked to test a number of residents 
at an institution for those with mental retardation. 
These residents had just planned and successfully ex- 
ecuted an escape from the security-conscious school, 


a feat requiring high levels of practical intelligence. 
Yet, when administered the Porteus Maze Test (Por- 
teus, 1965), a standardized test reputed to involve 
planning ability, they could not solve even the sim- 
plest maze correctly. Sternberg (1986) has made it 
clear that intelligence just has too many components 
to be measured by any single test. 


SUMMARY 


1. In spite of symposia and scholarly analysis, 
the concept of “intelligence” has eluded consensual 
definition. Yet, two themes recur with some fre- 
quency in expert definitions of intelligence. Ac- 
cording to the scholars, intelligence encompasses 
(1) the capacity to learn from experience, and 
(2) the capacity to adapt to one’s environment. 


2. Lay and expert conceptions of intelligence 
are very similar. In order of importance, laypersons 
regard practical problem-solving ability, verbal 
ability, and social competence as the key ingredi- 
ents; experts see verbal intelligence, problem- 
solving ability, and practical intelligence as crucial. 


3. The first theories of intelligence, proposed 
in the late 1800s, emphasized sensory acuity. Sir 
Francis Galton and J. McKeen Cattell both be- 
lieved that intelligence was underwritten by keen 
sensory abilities. They developed several sensory 
measures in unsuccessful attempts to measure 
intelligence. 


4. In the early 1900s, Charles Spearman pro- 
posed that intelligence consisted of two kinds of 
factors: a single general factor g and numerous spe- 
cific factors s}, 55, 53, and so on. He helped invent 
factor analysis to aid his investigations into the 
nature of intelligence. 


5. L. L. Thurstone favored the view that intelli- 
gence consists of approximately seven group factors 
rather than a single general factor. These factors 
were verbal comprehension, word fluency, number, 
space, associative memory, perceptual speed, and in- 
ductive reasoning. Ultimately, Thurstone acknowl- 
edged the existence of g as a higher-order factor. 


6. Raymond Cattell proposed that intelligence 
consists of two major factors, fluid intelligence (8) 
and crystallized intelligence (g,). Fluid intelligence 
is alargely nonverbal and relatively culture-reduced 
form of mental efficiency. Crystallized intelligence 
is highly culturally dependent and is used for tasks 
that require a learned or habitual response. 


7. Jean Piaget proposed a developmental 
theme in his theory of intelligence. He suggested that 
schemas—organized patterns of behavior or mental 
structures that lead to knowing how to do some- 
thing—evolve toward greater and greater maturity 
through a process called equilibration. 

8. In Piaget’s theory, assimilation is the appli- 
cation of a schema to an object, person, or event. If a 
schema works, a state of equilibrium arises; if not, 
the result is disequilibrium—a state of dynamic ten- 
sion. In the latter case, the person must adjust the 
schema so that it works—a process called accommo- 
dation. 


9. J. P. Guilford proposed a structure-of- 
intellect (SOT) model to summarize his views on the 
multifaceted nature of intelligence. He classified in- 
tellectual abilities along three dimensions called op- 
erations (5 kinds), contents (5 kinds), and products 
(6 kinds). Thus, in all, Guilford proposed 150 differ- 
ent kinds of intelligence. 


10. According to the theory of simultaneous and 
successive processing, the human brain has two dis- 
tinct forms of information processing: simultaneous, 
in which primarily spatial groups of information are 
processed all at once, and successive, in which infor- 
mation is temporally organized in a linear series. 
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11. Information-processing conceptions of in- 
telligence are based on a loose analogy with the 
functioning of a computer. An architectural system 
(hardware), which is relatively “hard wired” and 
impervious to change by the environment, operates 
in conjunction with the functional components 
(software), which include the executive system (en- 
vironmentally learned components that steer prob- 
lem solving). 


12. A few researchers have investigated the bio- 
logical underpinnings of intelligence. For example, 
several studies indicate that psychometric intelli- 
gence correlates with aspects of brain-wave patterns. 
In some studies the complexity of an evoked brain 
wave (the average evoked potential, or AEP) corre- 
lates in the .70s with measured IQ. 


13. H. Gardner has proposed a theory of multiple 
intelligences based loosely upon the study of brain- 
behavior relationships. He argues for the existence 
of several relatively independent intelligences, in- 
cluding linguistic, musical, logical-mathematical, 
spatial, bodily-kinesthetic, and personal. 

14. R. Sternberg proposes a triarchic theory of 
intelligence with these aspects: componential intel- 
ligence (the internal mental mechanisms that are re- 
sponsible for intelligent behavior), experiential 
intelligence (the ability to deal effectively with 
novel tasks), and contextual intelligence (adap- 
tation to, shaping of, and selection of real-world 
environments). 
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Assessment of Infant Ability 
Assessment of Preschool Intelligence 
Practical Utility of Infant and Preschool Assessment 


Summary 


T: infant and preschool period extends from 
birth to roughly six years of age. The changes 
that occur during this period are obviously pro- 
found. The infant develops basic reflexes, masters 
developmental milestones (grasping, crawling, sit- 
ting, standing, and so forth), learns a language, and 
establishes the capacity for symbolic thought. For 
most children, the pattern and pace of development 
is visibly within normal limits. 

However, parents and professionals trained in 
the assessment of infants and preschoolers occa- 
sionally encounter children whose development 
seems to be slow, delayed, or even overtly retarded. 
These children elicit a flurry of anxious questions: 
How delayed is this child? What are the prospects 
for normal functioning in school? Will this child 
achieve personal independence in the adult years? 

At the opposite extreme are those precocious 
children who achieve developmental milestones 
months or years ahead of the normative schedule. 
In these cases, the proud parents have a different set 
of concerns: How advanced is my child? What are 
the strongest and weakest areas of intellectual func- 
tioning? Will this child be a gifted adult? 

Infant and preschool assessment devices can 
help answer questions about children at both ex- 
tremes of the spectrum—those who might be de- 
velopmentally delayed, and those who might be 
intellectually gifted. Of course, these tests also pro- 
vide useful information about the vast majority of 
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children who fall in the middle of the distribution. 
In this topic, we review the nature and application 
of prominent infant and preschool measures. These 
tools include individual tests, developmental sched- 
ules, and rating scales. We begin with a description 
of several prominent instruments and then investi- 
gate the fundamental question of purpose or utility. 
What is the use of these measures? What is the 
meaning of a score on a developmental schedule or 
preschool intelligence test? To what extent do these 
procedures allow us to prognosticate adult abilities 
or, for that matter, help us to predict early school 
performance? These questions will be more mean- 
ingful if we first review the relevant instruments. 

We divide the review into two parts: infant mea- 
sures for children from birth to age 2%, and 
preschool tests for children from age 2% to age 6. 
The division is somewhat arbitrary, but not entirely 
so. Infant tests tend to be multidimensional and to 
load significantly on sensory and motor develop- 
ment. Beginning at age 2⁄2, standardized measures 
such as the Stanford-Binet: Fourth Edition, Kauf- 
man Assessment Battery for Children, Differential 
Ability Scales, and McCarthy Scales of Children’s 
Abilities are typically used in the assessment of 
preschool children. These tests load heavily upon 
cognitive skills such as verbal comprehension and 
spatial thinking. Thus, infant scales and preschool 
tests measure somewhat different components of 
intellectual ability. 


TOPIC5B ASSESSMENT OF INFANT AND PRESCHOOL ABILITIES 159 


[||| ASSESSMENT OF INFANT ABILITY 
Gesell Developmental Schedules 


Designed to measure the developmental progress 
of babies and children from 4 weeks to 60 months 
of age, the Gesell Developmental Schedules were 
first introduced in 1925 and then revised peri- 
odically (Gesell, Ilg, & Ames, 1974; Knobloch, 
Stevens, & Malone, 1987). Virtually all infant tests 
have borrowed or adapted items from the original 
schedules devised by Arnold Gesell (1880-1961), 
so it is fitting and proper that we begin our review 
with this instrument. 

The Gesell Developmental Schedules provide 
a standardized procedure for observing and evalu- 
ating the developmental attainment of children in 
five areas: gross-motor, fine-motor, language de- 
velopment, adaptive behavior, and personal-social 
behaviors. Most of the 144 items in the schedule 
are purely observational, based on the direct 
inspection of the child’s responses to toys and 
standard situations. For example, here are some il- 
lustrative items typically passed by a 40-week-old 
infant: 


Adaptive 

Points at a pellet in a glass 

Pulls a string to obtain a ring 
Gross-motor 

“Cruises” a rail using two hands 

Lets self down with control 
Fine-motor 

Grasps a pellet promptly 

Uses “scissors” grasp on string 
Language 

Uses “da da” with meaning 

Responds to “no no” word 
Personal-Social 

Extends toy, no release 

Pushes arms through dress, if started 


The age range of the Gesell Developmental Sched- 
ules is birth to 60. months. The genius of Gesell 
was in identifying naturally occurring situations 
in the home or clinic and in using objects or tasks 


with high appeal for infants and preschoolers. 
In some cases, information from a parent or care- 
taker is needed to score individual items. In spite of 
the naturalistic testing environment, well-trained 
observers can attain interexaminer reliabilities in 
the middle .90s (Knobloch, Stevens & Malone, 
1987). 

The Gesell Developmental Schedules are used 
mainly by pediatricians and other child specialists 
to identify infants and children at risk for neuro- 
logical impairment and mental retardation. Gesell 
never intended his schedule to be an intelligence 
test. He brought a strong biological orientation to 
his research and assumed that normal development 
was a maturational unfolding that occurred in a pre- 
dictable sequence. Gesell determined that normal 
development is a time-bracketed phenomenon: The 
age variability for attaining developmental mile- 
stones in infancy is generally small, on the order of 
a few weeks for many tasks. Therefore, serious 
delay in meeting his painstakingly chronicled de- 
velopmental milestones may indicate neurological 
impairment or mental retardation (Honzik, 1983; 
Lewis & Sullivan, 1985). Several studies indicate 
that the Gesell Developmental Schedules function 
well in the screening of at-risk infants (Knobloch, 
Stevens, & Malone, 1987). 

Even though the Gesell Developmental Sched- 
ules are used mainly for clinical screening and di- 
agnosis, Knobloch, Stevens, and Malone (1987) 
provide a loosely defined basis for obtaining De- 
velopmental Quotients for the five areas and over- 
all development. The formula is as follows: 


ig Maturity Age x100 
Chronologic Age 

The Maturity Age is based on the “total clinical pic- 
ture” of developmental milestones passed and 
failed in each area. Although precise criteria are not 
provided, the Maturity Age for an infant appears to 
be the developmental age at which most items are 
passed. Since its technical properties are not well 
studied, the Developmental Quotient should be 
used mainly as a research tool. 
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The Gesell tests are widely respected because 
they provide detailed descriptions of infant devel- 
opmental milestones that are unequaled in the child 
assessment literature (Nuttall, Romero, & Kalesnik, 
1992). However, the use of the Gesell as a psycho- 
metric instrument has been sharply criticized in re- 
cent years. The basic problem appears to be a lack 
of attention to formal criteria for reliability and va- 
lidity. For example, early Gesell manuals rarely if 
ever reported test-retest reliabilities. When contem- 
porary researchers examined this property of the 
Gesell tests, the results were surprising. Lichtenstein 
(1990) reported a test-retest correlation of only .73 
with a sample of 46 children, which falls well below 
the recommended level of .90 for making decisions 
about individuals (Nunnally, 1978; Salvia & Ys- 
seldyke, 1991). Banerji (1992) concluded that the 
Gesell tests functioned poorly as a screening device 
for school readiness. In general, educational spe- 
cialists are wary of using the Gesell for decisions 
about school placement or retention. 


Neonatal Behavioral Assessment Scale (NBAS) 


The Neonatal Behavioral Assessment Scale 
(NBAS) is unique because of its theoretical basis, 
which emphasizes the need to document the con- 
tributions of the newborn to the parent-infant sys- 
tem. The pediatrician T. Berry Brazelton (Brazelton 
& Nugent, 1995) developed this instrument to iden- 
tify and understand the “deviant” infant and to ex- 
plore the baby’s reciprocal impact on parents: 


My goal in developing the NBAS was to assess the 
baby’s contributions to the failures that resulted, 
when parents were presented with a difficult or de- 
viant infant. If we could understand the reasons be- 
hind the infant’s deviant behavior, perhaps we 
could in turn lead parents to a better understanding 
of their role. This then could lead to a more opti- 
mal outcome. (Brazelton & Nugent, 1995) 


The NBAS is suitable for infants up to two 
months of age, but is most commonly administered 
in the first week of life. The scale assesses the in- 
fant’s behavioral repertoire on 28 behavior items, 
each scored on a 9-point scale. Examples of the be- 
havior items include the following: 
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e Response decrement to light 

e Orientation to inanimate visual stimulus 
e Cuddliness 

e Consolability 


In addition, the infant’s neurological status is eval- 
uated on 18 reflex items, each scored on a 4-point 
scale. Examples include the following: 


e Plantar grasp 

e Babinski reflex 
e Rooting reflex 
e Sucking reflex 


Finally, seven supplementary items can be used to 
summarize the qualities of responsiveness of frail, 
high-risk infants, including these: 


« Quality of alertness 
e General irritability 
e Examiner’s emotional response to infant 


Brazelton and Nugent (1995) do not provide an 
integrative scoring system; that is, there are no sum- 
mary scores for the entire battery or its subcompo- 
nents. Instead, the “scoring” of the NBAS consists 
of a summary sheet with ratings on each specific 
item. In clinical work, the instrument is used to pro- 
vide feedback to parents. Specifically, Brazelton 
recommends that health care professionals demon- 
strate the NBAS in order to sensitize parents to their 
baby’s uniqueness and to promote a positive parent— 
infant relationship. Regarding clinical use of the 
test, Fowles (1999) compared mothers who received 
a demonstration of the NBAS with a matched con- 
trol group and showed that the intervention group 
subsequently rated their infants as significantly 
more predictable. Thus, the NBAS was found to be 
useful in helping mothers anticipate their infants’ re- 
sponses to environmental stimuli. 

For research on newborn outcomes, various in- 
vestigators have developed scoring systems for the 
NBAS, including a popular seven-cluster scoring 
method proposed by Lester (1984). This method 
provides summary scores for identified clusters 
(habituation, orientation, motor performance, 
arousal/lability, regulation, autonomic stability, and 
reflexes). Using a quantitative scoring approach, re- 
searchers have linked prenatal cocaine exposure to 
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inferior performance on the NBAS (Morrow et al., 
2001; Schuler, 1999). In addition, the NBAS is also 
sensitive to the detrimental effects of polychlori- 
nated biphenyls (PCBs) on babies born to women 
who consumed contaminated Lake Ontario fish 
(Stewart, Reihman, Lonky, Darvill, & Pagano, 
1999). Brazelton and Nugent (1995) summarize 
other studies with the test. 

In spite of the proven utility of the NBAS as a 
clinical and research tool, reviewers have been 
somewhat skeptical about the psychometric prop- 
erties of the instrument. For example, Majnemer 
and Mazer (1998) point to very low test-retest reli- 
ability coefficients (r =-0.15 to +0.32 for the indi- 
vidual items) and weak interrater agreement. One 
likely explanation is that in newborn infants, indi- 
vidual traits may fluctuate rapidly over short peri- 
ods of time, which would produce an underestimate 
of true reliability when the NBAS is given twice 
over a period of days or weeks. For this reason, de- 
viant scores from a single administration of the 
NBAS should not be overinterpreted. 


Ordinal Scales of 
Psychological Development 


The Ordinal Scales of Psychological Development 
(OSPD), hereafter called the Ordinal Scales, were 
designed as a Piagetian-based tool for measuring in- 
tellectual development between the ages of 2 weeks 
and 2 years (Uzgiris & Hunt, 1989). The Ordinal 
Scales consist of six scales, each designed to mea- 
sure a specific ability that arises during Piaget’s first 
stage of sensorimotor intelligence. Each scale con- 
sists of 5 to 15 separate ordinal steps; that is, the 
items are arranged in a normally invariant develop- 
mental sequence. 
The scales are as follows: 


¢ Visual pursuit and permanence of objects 

* Development of means-ends 

e Vocal and gestural imitation 

¢ Development of operational causality 

* Construction of object relations in space 

¢ Development of schemes for relating to 
objects 


In light of the many adversities that arise when 
testing infants—they may cry, regurgitate, crawl 
away, ignore the task, fall asleep, or fixate on the 
tester’s beard—the scales of this instrument pos- 
sess surprisingly strong psychometric properties. In 
one study of 84 infants, the Ordinal Scales showed 
excellent interobserver reliability (mean of 96 per- 
cent), good test-retest consistency, respectable or- 
dinality, and very strong correlations with age 
(Uzgiris, 1976). In short, this instrument appears to 
be a psychometrically sound index of sensorimotor 
intelligence. 

Uzgiris (1983) believes that intellectual func- 
tioning in infancy is qualitatively different and 
“needs to be understood in its own right.” The Or- 
dinal Scales were developed as a means of investi- 
gating infant intelligence within the theoretical 
framework developed by Piaget. For this reason, 
Uzgiris makes no pretense of prediction for her 
instrument. Along this line, Kahn (1992) demon- 
strated that the OSPD, administered at age 6, is a 
very weak predictor of adaptive behavior in chil- 
dren with severe and profound mental retardation 
who were retested at age 10. 

In general, correlations between scale scores 
and later IQ are very low until infants are at least 
18 months of age. Very few clinicians use the in- 
strument for developmental screening. However, 
Dunst (1980) has argued for using the Ordinal 
Scales as a basis for designing a developmentally 
sound curriculum for disabled children. More re- 
cently, Auer and Reisberg (1996) have raised the 
intriguing possibility that the Ordinal Scales can be 
used for the cognitive assessment of severe de- 
mentia in the elderly. 


Bayley Scales of Infant Development-Il 


After decades of prominence in the field of infant 
assessment, the Bayley Scales of Infant Develop- 
ment (BSID) have been recently revised (Bayley, 
1969, 1993). The format of the scale is the same— 
the Mental Scale and the Motor Scale provide quan- 
titative normalized standard scores with mean of 
100 and standard deviation of 16—but the Bayley- 
II covers a wider age range, extending from ages 1 
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month to 42 months. The third component, the Be- 
havior Rating Scale, consists of 30 items designed 
to assess attention, orientation, emotional regula- 
tion, and motor quality. The Bayley-II has been 
renormed on a stratified random sample of 1,700 
children who closely parallel the 1988 U.S, Census 
statistics on age, sex, ethnicity, region, and parental 
education. 

The Mental Scale measures the following 
abilities: 


Sensory/perceptual acuities 
Acquisition of object constancy 
Memory, learning, and problem solving 
Vocalization, verbal communication 
Early evidence of abstract thinking 
Habituation 

Mental mapping 

Complex language 

Mathematical concept formation 


The Motor Scale assesses the following skills: 


Degree of bodily control 

Coordination of large muscles 

Fine motor control of hands and fingers 
Dynamic movement 

Dynamic praxis 

Postural imitation 

e Stereognosis 


The technical quality and excellent standard- 
ization of the Bayley Scales mark this test as the 
psychometric pinnacle of its field (Sattler, 2001). 
Although the Bayley-II has only a modest amount 
of validational research, this instrument strongly 
resembles its predecessor, for which a huge amount 
of validity evidence can be cited. Thus, the validity 
of the Bayley-II rests, in part, upon its resemblance 
to the Bayley. Regarding validity, the Bayley man- 
ual reports a correlation of .57 between the Mental 
Scale and Stanford-Binet IQ for 120 children ages 
24 to 30 months. Self and Horowitz (1979) re- 
viewed the voluminous literature on correlates of 
Bayley Scale scores. The Bayley shows strong re- 
lationships with the Stanford-Binet, Wechsler 
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scales, Piagetian task performance, social class, 
and environmental factors. Also, very low scores on 
the Bayley predict poor developmental outcome in 
later childhood (VanderVeer & Schweid, 1974). 
Rhodes, Bailey, and Yow (1983) cite additional val- 
idation evidence. 

A recent validational study of the BSID-II with 
premature infants found strong agreement between 
this test and the first edition, supporting the clini- 
cal validity of the revision (Goldstein, Fogle, 
Wieber, & O’Shea, 1995). Even so, Niccols and 
Latchman (2002) provide a skeptical note. In a lon- 
gitudinal study comparing the BSID and the BSID- 
II in 16 infants with Down syndrome and 17 
medically fragile infants, they found a complex in- 
teraction between test form, infant diagnosis, and 
time of testing. When tested at about 7 months of 
age with both forms and then retested at 22 months 
of age, average scores for both the BSID and the 
BSID-II declined for the Down syndrome group; 
however, for the medically fragile infants, the BSID 
scores declined, but the BSID-II scores increased. 
This complex pattern of results calls into question 
the comparability of the two forms and suggests 
reason for caution in using the BSID-II for predic- 
tive purposes in high-risk infants. A study with 
healthy Australian infants reported that BSID-II 
scores were appropriately lower than BSID scores, 
indicating that the norms for the first edition truly 
were outdated (Tasbihsazan, Nettelbeck, & Kirby, 
1997). In spite of these supportive studies, Nellis 
and Gridley (1994) suggest caution with the BSID- 
II until further research is available. 

The Bayley Scales require more skill to admin- 
ister and interpret than comparable instruments 
such as the Denver-2. It also takes longer (45 to 75 
minutes). Consequently, the Bayley Scales are re- 
served for special assessments and research appli- 
cations; they are not commonly used as a routine 
screening instrument. 


In Brief: Additional Measures of Infant Ability 


The assessment of infants is so important and yet so 
difficult. Infants do not ordinarily follow directions 
and they may not be able to verbalize what they 
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know. The assessment of infant abilities is an extra- 
ordinary challenge. Nonetheless, dozens of test de- 
velopers have risen to the summons. Even a brief 
review of alternative instruments would be chapter- 
length. We provide a quick summary of better- 
known approaches in Table 5.4. Most of these 
instruments involve observation or the presentation 
of simple tasks to the examinee. For additional re- 
views of infant assessment, the reader is encouraged 


TABLE 5.4 Additional Measures of Infant Ability 


to read Nuttall, Romero, and Kalesnik (1992), Ric- 
ciuti (1994), and Salvia and Ysseldyke (1991). 


ASSESSMENT OF 
PRESCHOOL INTELLIGENCE 


Preschool children exhibit wide variability in emo- 
tional maturity and responsiveness to adults. One 
child may warm up to the examiner and strive for 





Battelle Developmental Inventory (BDI) (Newborg, Stock, Wnek, Guidubaldi, & 
Svinicki, 1984). Birth to age 8; the 341 items assess Personal-Social, Adaptive, Motor, 
Communication, Cognitive, and Total domains. The full battery takes 1-2 hours to ad- 
minister; a screening version of the BDI (96 items) has been severely criticized. 


Developmental Assessment of Young Children (DAYC) (Voress & Maddox, 1998). 
Birth to age 6; assessment in five domains (cognition, communication, social-emotional, 
physical, and adaptive) is completed through observation, interview of caregivers, and 
direct assessment. DAYC provides a brief assessment (20 minutes) based upon outstand- 
ing normative data (1,300 children divided into 23 age groups approximating the 1996 
census). The resulting 5 indices and global index are highly reliable (coefficients ranging 


from .90 to .99). 


Developmental Indicators for the Assessment of Learning-3 (DIAL-3) (Mardell- 
Czudnowski & Goldenberg, 1998). Ages 3 through 6; domains assessed include Motor 
(e.g., catching, cutting, writing), Concepts (e.g., naming, counting, sorting), and Lan- 
guage (e.g., nouns/verbs, problem solving, sentence length). The test-retest reliability 
in the high .80s is extraordinary for an instrument of this type. English and Spanish 


versions are available in the same kit. 


Early Screening Inventory-Revised (ESI-R) (Meisels, Marsden, Wiske, & Henderson, 
1997). Ages 3 to 6; a brief screening instrument that comes in two forms, the Preschool 
version (ESI-P), and the Kindergarten version (ESI-K). Three areas of development are 
sampled: Visual-Motor/Adaptive, Language and Cognition, and Gross Motor. The total 
score is used to classify children into one of three referral groups: “OK” (above average 
to minus | SD), “rescreen” (between minus 1 and minus 2 SDs), and “refer” (below 


minus 2 SDs). 


Early Screening Profiles (ESP) (Harrison, Kaufman, Kaufman, and others, 1990). Ages 
2 through 6; domains assessed include Cognitive/Language, Motor, and SelfHelp/Social; 
four Surveys (Articulation, Behavior, Health History, and Home) supplement the assess- 
ment. This instrument has strong psychometric qualities; the manual reports detailed 
information on seven validation studies completed independently of the standardization 
study. Barnett (1995) offers a skeptical review; Telzrow (1995) is more positive. 


Gesell Child Development Age Scale (GCDAS) (Cassel, 1990). 18 months through 10 
years; this test attempts to operationalize Gesell’s stage theory of child development by 
asking a mother, teacher, or clinician to respond true-false to 100 age-appropriate items 
from a larger set of 240 total items. Up to three raters can be used to evaluate a child; 
results include a graphic printout of chronological versus developmental age across 10 
developmental areas. The GCDAS is a promising test that needs further research as to 


psychometric qualities (Lang, 1995). 
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optimal performance on all questions. Another 
child may stare mutely at the floor rather than at- 
tempt a simple block design task. For the first child, 
we can rest assured that the test results are an ap- 
propriate index of cognitive functioning. But for 
the second child, uncertainty prevails. Does the 
nonresponsiveness signal a lack of skill or a lack of 
cooperation? With preschool children, a large mea- 
sure of humility is required of the examiner. Scarr 
(1981) has expressed this sentiment as follows: 


Whenever one measures a child’s cognitive func- 
tioning, one is also measuring cooperation, atten- 
tion, persistence, ability to sit still, and social 
responsiveness to an assessment situation. 


The special danger in preschool assessment is that 
the examiner may infer that a low score indicates 
low cognitive functioning when, in truth, the child 
is merely unable to sit still, attend, cooperate, and 
so forth. Preschool assessment needs to be ap- 
proached with unusual caution to avoid negative 
consequences of labeling and overdiagnosis of dis- 
abling conditions. 

There are several individually administered in- 
telligence tests suitable for preschool children. 
Schakel (1986) has dubbed the following tests as 
“the big 4”: 


e Wechsler Preschool and Primary Scale of 
Intelligence (WPPSI-R) 

+ Stanford-Binet: Fourth Edition (SB:FE) 

* Kaufman Assessment Battery for Children 
(K-ABC) 

e McCarthy Scales of Children’s Abilities (MSCA) 


These are the most commonly used intelligence 
tests for preschool children. The last of the four is 
rapidly approaching obsolescence (it was published 
in 1972). Unless it is revised, school psychologists 
will soon speak of “the big 3.” Of course, some of 
these instruments extend beyond the preschool age 
range into early childhood. The SB:FE is used for 
adults as well. The fifth edition of the Stanford- 
Binet (SB5), released in 2003, is described in the 
next chapter. We review these tests and an addi- 
tional, promising test: 


e Differential Ability Scales 
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The Wechsler Preschool and Primary Scale 
of Intelligence-Revised (WPPSI-R) 


The WPPSI-R is very similar to its predecessor, but 
offers updated norms and application to a wider age 
range—ages 3 years to 7 years and 3 months (Wech- 
sler, 1989). In addition, several dated and biased 
items were revised, and a version of Object Assem- 
bly was added to the original 11 subtests. Reliabil- 
ity and validity data for the WPPSI-R are very 
similar to the earlier version of this test. Salvia and 
Ysseldyke (1991) summarize split-half reliabilities 
from several sources as follows: Verbal IQ (.86 to 
.96), Performance IQ (.85 to .93), and Full Scale IQ 
(.90 to .97). Reliabilities for the WPPSI-R subtests 
are substantially weaker; interpretations of the 
WPPSI-R should be restricted to the composite IQ 
scores. Norms for the WPPSI-R are based upon a 
carefully stratified sample of 1,700 children. The 
sample was stratified on age, sex, geographic region, 
ethnicity, and parental education and occupation. 
The validity of the WPPSI-R is predicated, in part, 
upon its resemblance to the WPPSI, which earned 
high praise from reviewers. Sattler (1988) reviewed 
several dozen studies supporting the concurrent and 
predictive validity of the WPPSI and concluded that 
the test serves as an excellent long-term predictor of 
intelligence and school performance in adolescence. 

Initial research with the WPPSI-R confirms 
the predictive validity of this instrument for 
later school performance. For example, Kaplan 
(1996) determined that preschool WPPSI-R results 
strongly predict elementary school achievement 
scores for children in kindergarten through third 
grade. Results for the third graders are summarized 
in Table 5.5 and reveal that Verbal and Full Scale 
IQs were much more powerful predictors of later 
achievement than Performance IQ. 

The WPPSI-R subtests include the following: 


Verbal Performance 
Information Object Assembly 
Comprehension Geometric Design 
Arithmetic Block Design 
Vocabulary Mazes 

Similarities Picture Completion 
Sentences Animal Pegs 
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TABLE5.5 Correlations between Preschool 
WPPSI-R IOs and Later Achievement Test 
Results for 72 Third Graders 


WPPSI-R Scores 
Comprehensive Testing —— 
Program-Ill Scores VIQ PIQ FSIQ 
Verbal Ability 102% .24 52% 
Auditory Comprehension .46* .02 .30 
Reading Comprehension 48* -.03 .28 
Writing Mechanics .54* .16 .42 
Writing Process .45* .08 32 
Quantitative Ability .44 .17 36 
Mathematics .62* .34* 58* 





*p < 005 


Source: Reprinted with permission from Kaplan, C. (1996). Pre- 
dictive validity of the WPPSI-R: A four year follow-up study. Psy- 
chology in the Schools, 33, 211-219. Copyright © 1996 John Wiley 
& Sons, Inc. Reprinted by permission of John Wiley & Sons, Inc. 


Three of these twelve subtests (Sentences, Geo- 
metric Design, and Animal Pegs) are found only on 
the WPPSI-R and are briefly summarized. here. 
(The other nine subtests, common to all the Wech- 
sler scales, are discussed in Topic 6A: Individual 
Tests of Intelligence.) Sentences is a supplementary 
subtest on the WPPSI-R. This subtest requires the 
child to repeat verbatim a sentence that has been 
read out loud by the examiner. The easiest item is 
on a par with “John had a green car,” while the most 
difficult item is much longer and consists of two 
connected sentences like these: 


“This Friday we will visit the farmer’s garden. 
Bring a quarter so you can buy a pumpkin.” 


The Geometric Design subtest consists of 10 de- 
signs—including a circle, a square, and a dia- 
mond—that the child is asked to copy. This subtest 
is a measure of perceptual and visual-motor orga- 
nization abilities. Finally, the Animal Pegs subtest 
requires the child to place a cylinder of the desig- 
nated color (black, white, blue, yellow) in a hole 
underneath the appropriate animal (dog, chicken, 
fish, and cat, respectively). There are 25 animals 
randomly sequenced in a 5 x 5 array. The initial 


score on this subtest is the amount of time needed 
to place a cylinder underneath each animal. Errors 
detract from the overall score. Success on Animal 
Pegs requires learning ability, manual dexterity, 
and sustained attention for the several minutes that 
might be needed to place an appropriate cylinder 
under each of the 25 animals. 

The extension of age coverage downward to 
age 3 is a welcome addition to the WPPSI-R, inso- 
far as the early identification of developmental dif- 
ficulties is essential to their remediation. Also, the 
IQ norms for the WPPSI-R extend downward to a 
score of 41, which is about 3.9 standard deviations 
below the population mean. Especially when used 
in conjunction with an assessment of adaptive be- 
havior, the WPPSI-R is an essential tool in the di- 
agnosis of mild to severe mental deficiency in 
preschool and early school-aged children. The IQ 
norms for this test also extend well beyond the 
range necessary for identification of giftedness in 
most school settings. These features have made the 
WPPSI-R very popular with school psychologists 
and early development specialists. 


Stanford-Binet: Fourth Edition 


With an age range of 2 years through adulthood, the 
Stanford-Binet: Fourth Edition (SB:FE) is one of 
those rare tests designed for use with preschoolers, 
children, and adults alike (Thorndike, Hagen, & 
Sattler, 1986). We present a detailed discussion of 
SB:FE psychometric properties in the next chapter 
on individual and group tests of intelligence. A few 
comments here will briefly summarize its value in 
preschool assessment. 

The SB:FE consists of 15 subtests, but not all 
subtests are administered to each age group. The 
subtests and those typically administered to pre- 
school children (up to age 5) are listed in Table 5.6. 
The reader will notice that the SB:FE yields a num- 
ber of subtest scores, four area scores, and an over- 
all composite score, which is no longer called an 
IQ. Unfortunately, with preschoolers the four con- 
tent areas are not sampled to the same depth. The 
Verbal Reasoning area is well represented by three 
subtests, but the Quantitative score is based upon a 
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TABLE5.6 Subtests and Areas of the Stanford- 
Binet: Fourth Edition 


Verbal Reasoning 
Vocabulary* 
Comprehension* 
Absurdities* 

Verbal Relations 

Abstract/Visual Reasoning 
Pattern Analysis* 
Copying* 

Matrices 
Paper Folding and Cutting 

Quantitative Reasoning 
Quantitative* 

Number Series 
Equation Building 

Short-Term Memory 
Bead Memory* 

Memory for Sentences* 
Memory for Digits 
Memory for Objects 





*Denotes a subtest commonly used with preschool children. 


single subtest. The other two areas (Abstract/Visual 
Reasoning and Short-Term Memory) are based 
upon results from only two subtests. As discussed 
in the next chapter, Sattler (1988) advocates a two- 
factor solution to the reporting of SB:FE scores 
(Verbal Comprehension and Nonverbal Reasoning/ 
Visualization), which is certainly the preferred ap- 
proach with preschoolers. 

An essential feature of the SB:FE is that the 
overall composite score is highly comparable with 
other mainstays of preschool assessment such as the 
WPPSI-R and the WISC-III. For example, the 
WPPSI-R manual reports similar global scores for 
115 children, four to seven years of age, tested with 
both instruments: average WPPSI-R IQ of 105.3 
versus SB:FE composite score of 107.2 (Wechsler, 
1989). In a study of 30 preschool children, the av- 
erage WPPSI-R IQ was 94.1, while the SB:FE com- 
posite score was 95.8 (McCrowell & Nagle, 1994). 
However, the verbal components of the two tests 
differed significantly: 95.5 for VIQ on the WPPSI- 
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R and 101.6 for Verbal Reasoning on the SB:FE. 
Rust and Lindstrom (1996) found comparable 
scores on the SB:FE and the WISC-III for 57 vol- 
unteers (ages 6 to 17 years), with overall scores dif- 
fering less than 2 points, on average. Lavin (1996) 
also reported nearly identical overall scores on these 
two instruments for 40 children ages 6 to 16 years. 
However, in testing children with known develop- 
mental problems, Lukens and Hurrell (1996) found 
that WISC-III scores were lower in 29 of 31 cases, 
indicating that the SB:FE may underdiagnose men- 
tal retardation. A thorough discussion of the SB:FE 
in the context of preschool assessment can be found 
in Nuttall, Romero, and Kalesnik (1992). A review 
of validity research with the SB:FE is provided by 
Laurent, Swerdlik, and Ryburn (1992). 


Kaufman Assessment Battery 
for Children (K-ABC) 


The K-ABC is a combined measure of intelligence 
and achievement that was constructed loosely 
within the theoretical framework of modern neuro- 
psychology (Luria, 1966; Das, Kirby, & Jarman, 
1979). Many of the K-ABC subtests resemble 
neuropsychological tests, which we discuss in 
more detail in a later topic. Even so, the K-ABC is 
oriented primarily toward psychoeducational as- 
sessment and educational planning (Kaufman & 
Kaufman, 1983). Proponents of the K-ABC claim 
that it possesses greater relevance to psychoeduca- 
tional planning than traditional tests such as the 
Wechsler scales and the Stanford-Binet. 

Designed for examinees ages 2% to 12%, the 
K-ABC consists of 16 subtests, with no more than 
13 administered to any one child (Figure 5.5). Ten 
of the subtests yield the Mental Processing Com- 
posite, which is normed to generate the familiar av- 
erage of 100 and standard deviation of 15. The 
other six subtests make up the Achievement Scale. 
The ten mental processing subtests are broken 
down into two global scales; the Simultaneous Pro- 
cessing Scale (7 subtests) and the Sequential Pro- 
cessing Scale (3 subtests). 

One goal of the K-ABC is to yield scores that 
translate to educational intervention. Based loosely 
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Hand Movements* 


Sequential 
Processing Number Recall 
Scale 
f Word Order 
Mental Magic Window 
Processing 
Composite 


Face Recognition* 


\ Gestalt Closure 


Simultaneous 

Processing Triangles* 

Scale 
Matrix Analogies* 
Spatial Memory* 


Photo Series* 
Expressive Vocabulary 


Faces and Places 


Achievement |Arithmetic 

Scale 
Riddles 
Reading/Decoding 
Reading/Understanding 


*Nonverbal scale. 





FIGURE 5.5 Subtests and Scales on the Kaufman 
Assessment Battery for Children 


upon neuropsychological concepts, the Sequential 
and Simultaneous scales are hypothesized to reflect 
the child’s style of problem solving and information 
processing. The Sequential Processing subtests re- 
quire serial or temporal arrangement of verbal, nu- 
merical, or visuoperceptual content. Children who 
score high on this scale—sequential learners—are 
presumed to learn best by encountering small 
amounts of information in consecutive, step-by-step 
order, such as a series of clearcut verbal instructions. 
In contrast, the Simultaneous Processing subtests 
require the child to synthesize and organize visuo- 


perceptual or spatial content in an immediate or 
wholistic fashion. Children who score high on this 
scale—simultaneous learners—are presumed to 
learn best by integrating and synthesizing many re- 
lated pieces of information at the same time, such 
as found in visual media (pictures, maps, or charts). 

Kaufman, Kaufman, and Goldsmith (1984) 
provide guidelines and examples for teaching read- 
ing, spelling, and arithmetic to children with K- 
ABC-based sequential or simultaneous processing 
strengths. Although the theory is compelling, sup- 
port for the hypothesized K-ABC aptitude-treat- 
ment interaction is mixed at best. For example, 
Fisher, Jenkins, Bancroft, and Kraft (1988) 
matched teaching strategies to sequential/simulta- 
neous cognitive styles.(as determined from the K- 
ABC) for 57 elementary schoolchildren enrolled in 
a learning disability clinic. Although the results 
generally supported the predicted aptitude-treat- 
ment interaction, the effects were small and not of 
any practical significance. 

Both scales in the Mental Processing Compos- 
ite (MPC) were designed to reduce the effects of 
sex and race bias, and by most reports the test 
developers succeeded in these goals (Nolan, 
Watlington, & Willson, 1989). Based on the stan- 
dardization data, Kaufman, Kamphaus, and Kauf- 
man (1985) reported small differences (on the order 
of 5 points) between MPC scores obtained by white 
and minority group members on the K-ABC. This 
is a much smaller difference than typically found 
with tests such as the WISC-III or Stanford-Binet, 
for which differences on the order of 15 IQ points, 
favoring whites, are common. 

Valencia and Rankin (1988) tested 76 white and 
90 Mexican American fifth and sixth graders and re- 
ported almost no difference on the Mental Process- 
ing Composite (100 versus 98, respectively), although 
a large difference was found on the Achievement 
Scale (103 versus 91). Knight, Baker, and Minder 
(1990) report a comparable difference between the 
K-ABC and the SB:FE for 30 African American, el- 
ementary schoolchildren with learning disabilites 
(MPC of 83 versus SB:FE Composite of 84). 

The K-ABC scales and subtests are described 
in Table 5.7. In addition to the Simultaneous, 
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TABLE 5.7 A Description of K-ABC Subtests 


Sequential Processing Scale 

Hand Movements: The child must copy the precise sequence of taps on the table with the 
fist, palm, or side of the hand as performed by the examiner. 

Number Recall: Very: similar to the traditional digit span test, except the examiner is in- 
structed not to drop his/her voice after saying the last digit. 

Word Order; Measures the child’s ability to point to silhouettes of common objects in the 
same order as these objects were named by the examiner. 

Simultaneous Processing Scale 

Magic Window: Requires the child to identify and name an object whose picture is ro- 
tated behind a narrow slit so that only a fraction of the picture is exposed at any point in 
time. 

Face Recognition: The child must attend closely to one or two faces in a photograph 
shown briefly and then select the correct face(s) in a group photograph. 

Gestalt Closure: The child must name or accurately describe a partially completed 
inkblotlike drawing. This subtest measures the ability to mentally fill in gaps to form a 
gestalt. 

Triangles: The child must assemble several identical rubber triangles (yellow on one 
side, blue on the other) to match a picture of an abstract design. 

Matrix Analogies: Using vinyl chips, the child must select the picture or design that best 
completes a 2 x 2-inch matrix that expresses a visual analogy. 

Spatial Memory: The child must recall the locations of pictures arranged randomly on a 
page. 

Photo Series: The child must order a randomly arranged array of photographs in their 
proper time sequence. This subtest is similar to Picture Arrangement on the Wechsler 
scales, except that the task must be solved without physical manipulation, thereby elimi- 
nating an unwanted stress on visual-motor feedback. 


Achievement Scale 

Expressive Vocabulary: The child must name the object pictured in a photograph. 
Faces and Places: The child must name a well-known person, fictional character, or 
place depicted in a photograph. 

Arithmetic: A test of basic computational skills and school-related arithmetic abilities. 


Riddles: The child must infer the name of a concrete or abstract concept based on a list 
of its characteristics. 


Reading/Decoding: A test of letter identification and word recognition/pronunciation. 


Reading/Understanding: The child must demonstrate reading comprehension by follow- 
ing commands given in sentences. 





Sequential, and Achievement scales, a supplemen- 
tal Nonverbal Scale can be computed with six sub- 
tests (from the simultaneous and sequential groups) 
that do not require words. The subtests of the Non- 
verbal Scale include the following: 


Hand Movements 
Face Recognition 


Triangles 

Matrix Analogies 
Spatial Memory 
Photo Series 


For the Nonverbal subtests, the examiner demon- 
strates each task by example or pantomime. The 
Nonverbal Scale is appropriate for the testing of re- 
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cent immigrants or bilingual children whose Eng- 
lish language skills might be weak. In addition, 
children who have hearing impairment, speech dis- 
orders, or language disorders can be tested fairly 
with this scale. 

The K-ABC was standardized on a stratified 
national sample of 8,000 children carefully selected 
to represent the 1980 U.S. Census on sex, geo- 
graphic region, parental education, community 
size, and ethnic category (white, African American, 
Hispanic, other). An unusual and welcome feature 
of the standardization sample is the emphasis upon 
children in special education placements and 
programs for gifted and talented children (about 7 
percent of the norm sample). In addition, supple- 
mentary sociocultural norms for race and parental 
education were derived from test results on 469 
African American and 119 white children. 

The reliability of the K-ABC is generally quite 
good, although some subtests possess marginally ac- 
ceptable internal consistency coefficients, especially 
for younger subjects. For preschool subjects, mean 
values ranged from .72 for Magic Window to .88 for 
Number Recall. For school-age children, mean val- 
ues ranged from .71 for Gestalt Closure to .85 for 
Matrix Analogies. On the other hand, reliability of 
the Scale scores and the composite score is very ro- 
bust. For example, the test-retest reliability of the 
Achievement Scale is .93 for preschool children and 
.97 for school-aged children; the Mental Processing 
Composite has a reliability of .90 for preschool chil- 
dren and .95 for school-aged children. 

Validity studies of the K-ABC present a mixed 
picture, with good support for convergent/discrim- 
inant validity, strong confirmation of criterion-re- 
lated validity, and good support for age-appropriate 
changes in test scores (relevant to construct valid- 
ity). However, factor-analytic studies of construct 
validity show mixed and conflicting outcomes. 
Kamphaus, Beres, Kaufman, and Kaufman (1996) 
provide a thorough review of the literally dozens of 
validational studies of the K-ABC. 

The Simultaneous and Sequential scales show 
the expected pattern of correlations with simulta- 
neous and successive factors on other test batteries 
(e.g., Das & Mensink, 1989), indicating good con- 


vergent and discriminant validities for these global 
scales on the K-ABC. Cooley and Ayres (1985) 
found appropriately strong correlations between 
K-ABC scores and achievement measures (conver- 
gent validity) and appropriately negligible relations 
between K-ABC scores and measures of childhood 
anxiety (discriminant validity). A somewhat curi- 
ous finding also emerged: The K-ABC Mental 
Processing Composite correlated -.51 with the 
Hyperactivity Scale from the Achenbach Child 
Behavior Checklist (the correlation is negative be- 
cause a high score on Hyperactivity indicates dys- 
functional behavior). This well-validated checklist 
consists of items rated by the parents (e.g., child 
cannot concentrate, cannot pay attention for 
long, not liked by other children). Apparently, the 
K-ABC taps attentional capacities to some extent. 
Perhaps the overlap is indirect, with poor atten- 
tional abilities leading to reduced attainment of in- 
tellectual skills measured by the K-ABC. 

The validity of the K-ABC is also buttressed by 
the 43 studies cited and summarized in the Inter- 
pretive Manual, including numerous correlational 
studies with other tests (Kaufman & Kaufman, 
1983). These correlations vary widely but are gen- 
erally supportive of K-ABC validity, at least as 
regards the mental-processing composite. The find- 
ings of Obrzut, Obrzut, and Shaw (1984) and 
Naglieri (1985) are typical in this regard. Using 
independent samples of children with learning dis- 
abilities and educable students with mental retar- 
dation, these two studies reported nearly identical 
correlations of .80 and .83, respectively, between 
WISC-R IQ and K-ABC composite scores. 

As the reader will recall from the chapter on va- 
lidity, one way to demonstrate the construct valid- ` 
ity of a test is to show that age changes in raw 
scores are orderly, sensible, and theory-consistent. 
Reynolds, Willson, and Chatman (1984) correlated 
age and raw scores for the standardization sample 
(N = 2,000) and an additional sample of African 
Americans and whites (N = 615). All correlations 
between age and raw scores were highly signifi- 
cant. More important, no significant differences oc- 
curred in the magnitude of these relationships as a 
function of race or sex grouping, which supports 
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the construct validity of the K-ABC as a develop- 
mental measure of aptitude and achievement. 

Factor-analytic studies of the sequential and si- 
multaneous dimensions on which the K-ABC is 
based have produced conflicting results (Kamphaus, 
1990). On the positive side, the test developers cite 
findings from the standardization sample that seem 
to confirm the sequential/simultaneous distinction 
(Kaufman & Kaufman, 1983; Kaufman, Kaufman, 
Kamphaus, & Naglieri, 1982). Some independent 
studies have also reported similar findings (e.g., 
McCallum, Karnes, & Oehler-Stinnett, 1985). 

Critics acknowledge that the K-ABC is a good 
measure of general intelligence, but cast doubt 
on the distinction between simultaneous and se- 
quential processing as a basis for understanding 
test performance. For example, Strommen (1988) 
undertook a confirmatory factor analysis of the 
K-ABC specifically to test the hypothesis that the 
factors that make up the test are only moderately 
correlated, a key. assertion of the test authors 
(Kaufman & Kaufman, 1983). He concluded that 
the factors that underly the K-ABC are substantially 
intercorrelated at all age levels, casting doubt on the 
separate existence of sequential and simultaneous 
processes on the test. These modern constructs may 
well turn out to be old wine in new bottles, nothing 
more than a relabeling of the familiar dichotomy be- 
tween highly intercorrelated forms of verbal and 
nonverbal reasoning. For a more positive interpre- 
tation of K-ABC factor-analytic studies, see the re- 
view article by Kamphaus et al. (1996). 

A’second criticism surrounds the designation of 
six subtests as achievement tests. Anastasi (1985) 
points out that a test can be properly labeled an 
* achievement test only when it is closely tied to spe- 
cific instructional content. However, the authors of 
the K-ABC made special efforts to dissociate the 
achievement tests from any particular curricular 
content. These subtests more closely resemble tra- 
ditional measures of intelligence than academic 
achievement. In fact, based on factor loadings on 
the unrotated first factor, many researchers have 
concluded that the Achievement subtests provide a 
better measure of general intelligence than do the 
mental processing subtests (Kline, Guilmette, Sny- 
der, & Castellanos, 1992). This controversy rests, 
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in part, upon differing philosophical positions as to 
the nature of intelligence and is not likely to be re- 
solved by research (Reynolds, 1994b). 

Another concern is that the K-ABC does not tap 
verbal skills sufficiently (Sattler, 1988). In spite of 
these controversies, the K-ABC offers a unique and 
appealing approach to the assessment of children’s 
intelligence and possesses very high standards of 
technical quality. As long as examiners include the 
Achievement subtests—which tap general intelli- 
gence to a substantial degree—the K-ABC can 
provide a new and valuable approach to psycho- 
educational assessment. 


McCarthy Scales of Children’s Abilities 


The McCarthy Scales of Children’s Abilities 
(MSCA) is an individually administered intelligence 
test designed for children ages 2% to 8% years of age 
(McCarthy, 1972). The test consists of 18 sepa- 
rate subtests, as listed in Table 5.8. The subtests con- 
tribute to scores on five scales, each scale derived 
from three to seven subtests: Verbal, Perceptual- 
Performance, Quantitative, Memory, and Motor. In 
addition, a General Cognitive Index with mean of 
100 and SD of 16 is computed from 15 of the sub- 
tests. The test is designed to provide a better under- 
standing of both normal children and those with 
learning disabilities. McCarthy (1972) emphasized 
functional considerations such as the desire to iden- 
tify clinically and educationally relevant cognitive 
weaknesses as the primary criteria for item selection 
and subtest groupings on the McCarthy Scales. 

The standardization sample of 1,032 children 
consisted of approximately 100 subjects at half-year 
increments from ages 2% through 5% and full-year 
increments from 51⁄2 through 8%. At each age level, 
the sample was roughly stratified on the following 
variables in accordance with the 1970 U.S. Census: 
sex, race (white-nonwhite), geographic region, fa- 
ther’s occupational level, and urban-rural residence. 
Children with severe mental or emotional problems 
were excluded, and bilingual subjects were included 
only if they could speak and understand English. Of 
course, a potential problem with the McCarthy 
Scales is that the normative data, collected in the 
early 1970s, are quite dated. 
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TABLE 5.8 Subtests and Scales of the McCarthy Scales of Children’s Abilities 





Scales 
Perceptual General 
Subtests Verbal Performance Quantitative Memory Motor Cognitive Index 

Pictorial Memory V Mem GCI 
Word Knowledge V GCI 
Verbal Memory V Mem GCI 
Verbal Fluency V GCI 
Opposite Analogies Vv GCI 
Block Building A GCI 
Puzzle Solving P GCI 
Tapping Sequence P Mem GCI 
Right-Left Orientation P GCI 
Draw-A-Design | Mot GCI 
Draw-A-Child P Mot GCI 
Conceptual Grouping P GCI 
Number Questions Q GCI 
Numerical Memory Q Mem GCI 
Counting and Sorting Q GCI 
Leg Coordination Mot 

Arm Coordination Mot 

Imitative Action Mot 





Reliability findings for the McCarthy Scales 
present a mixed picture. The General Cognitive 
Index performs well, with split-half reliabilities 
averaging about .93 and one-month test-retest coef- 
ficients averaging about .90. Split-half reliabilities 
for the five scales range from .79 to .88, while test- 
retest coefficients range from .69 to .89. Reliabilities 
for the 18 individual subtests are substantially lower, 
so examiners are cautioned not to place too much 
emphasis upon subtest patterns and differences. -~ 

Unfortunately, the clinically based derivation of 
the five McCarthy scales has not been confirmed 
by factor-analytic studies, casting doubt on the con- 
struct validity of this instrument. Although five fac- 
tors (corresponding to the five scales) were found 
at most age groups in the standardization sample 
(Kaufman, 1975), later studies have not replicated 
the original findings. For example, Forns-Santa- 
cana and Gomez-Benito (1990) found five factors 


in a sample of 141 4- and 5-year-olds, but these fac- 
tors did not correspond to the breakdown proposed 
by McCarthy. Other researchers report similar in- 
stances of failure to confirm the original distribu- 
tion of the subtests. For example, Keith and Bolen 
(1980) found only three factors in a sample of 300 
children ages 6 to 8%: general cognitive, verbal, 
and motor. 

The confusion about the factorial structure of 
the McCarthy Scales indicates that examiners 
should be wary of profile analysis that relies upon 
the five scales previously listed (Verbal, Perceptual- 
Performance, Quantitative, Memory, and Motor). In 
many samples and for some age groups, the scales 
may be better measures of general cognitive ability 
than of the specific abilities designated by the names 
of the scales (Sattler, 1988). 

On the positive side, the McCarthy Scales func- 
tion very well as a predictor of school readiness and 
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later scholastic achievement for kindergarten chil- 
dren. Massoth and Levenson (1982) tested 33 
children with the MSCA in the fall term of kinder- 
garten and correlated these scores with results of a 
reading-readiness test administered one year later 
and also with achievement levels upon completion 
of first grade. Curiously, the strongest correlations 
were with the Quantitative Scale, whereas the 
Verbal Scale fared poorly as a predictor of school 
readiness or reading achievement (Table 5.9). The 
perceptual and analytical abilities measured by the 
McCarthy Scales appear to be better predictors of 
reading readiness and achievement than are the ver- 
bal tasks. A six-year follow-up of the same subjects 
revealed surprisingly high correlations between 
McCarthy Scales administered in kindergarten and 
scholastic achievement in the sixth grade (Massoth, 
1985). The Quantitative Scale showed the strongest 
correlation with course grades (r = .60), whereas the 
Verbal Scale was a weak predictor (r = .40). The 
Quantitative Scale would appear to be an excellent 
screening test for preschool children. 

In sum, the McCarthy Scales provide a valuable 
and predictive index of intellectual functioning, 
especially for children in the 5- to 6-year-old range. 
The instrument is also an excellent tool for assessing 
general intelligence, although it may underestimate 


TABLE 5.9 Correlations between McCarthy 
Scale Scores and Reading Readiness and 
Achievement for 33 Kindergarten Children 





Macmillan Metropolitan 
Reading Achievement 
McCarthy Scale Readiness Test 
Verbal 33 16 
Perceptual-Performance 39% STF 
Quantitative .64** 50 
General Cognitive ar 39% 
Memory .39* .28 
Motor 31 35* 
*p<.05. 
**p <.0l. 


Source: Adapted with permission from Massoth, N. A., & Leven- 
son, R.L. (1982). The McCarthy Scales of Children’s Abilities as 
a predictor of reading readiness and reading achievement. Psychol- 
ogy in the Schools, 19, 293-296. 


INTELLIGENCE TESTING I: THEORIES AND PRESCHOOL ASSESSMENT 


functioning in preschoolers and children with learn- 
ing disabilities and mental retardation. The norms for 
the test are now badly outdated. In spite of this, over- 
all scores on the MSCA correspond very closely to 
Full Scale IQs on the WPPSI-R (Karr, Carvajal, 
Elser, & Bays, 1993). Even so, the McCarthy Scales 
need to be revised and restandardized. 


Differential Ability Scales 


The Differential Ability Scales (DAS) are a recent 
addition to individual intelligence testing that 
deserve brief mention (Elliott, 1990, 1997). The 
DAS covers an age range of 22 years to 18 years 
in three overlapping batteries: lower preschool 
(ages 2:6-3:5), upper preschool (ages 3:6-5:11), 
and school-age (ages 6:0-17:11). We present the 
upper preschool battery here. 
` The subtests of the preschool battery consist of 
“core” and “diagnostic” subtests. The core subtests 
are heavily saturated with the g factor and are used 
to derive two area scores (Verbal and Nonverbal) and 
an overall composite score known as General Con- 
ceptual Ability (GCA). The area scores and the GCA 
are based upon a mean of 100 and standard deviation 
of 15. The diagnostic subtests measure short-term 
memory and speed of information processing. They 
are used for clinical analysis only. The diagnostic 
subtests are less dependent upon the g factor and 
therefore do not figure in the overall composite. The 
subtests of the DAS are described in Table 5.10. 
The reliability of the DAS scores is commend- 
able for an instrument used at the preschool level. 
For preschoolers, GCA reliability is reported to be 
.90 to .94. For older preschoolers (3% to 6 years of 
age) the Verbal and Nonverbal cluster scores show 
reliabilities of .88 and .89, respectively. Concurrent 
validity studies are highly supportive of the DAS, 
with correlations in the .70s and .80s with other 
preschool measures of intelligence and achievement 
(Elliott, 1990ab). A study by Dumont, Cruse, Price, 
and Whelley (1996) provides very strong support 
for DAS validity by providing a confirmatory pat- 
tern of correlations between this instrument and the 
WISC-III for 53 children identified as having learn- 
ing disabilities. The results are summarized in Table 
5.11 and show that similar components correlate 
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TABLE 5.10 Subtests of the DAS Preschool Battery 





Subtest 


Core Subtests 


Verbal 
Comprehension 
Naming 
Vocabulary 
Picture 
Similarities 
Pattern 
Construction 
Copying 

Early Number 
Concepts 
Diagnostic Subtests 
Block Building 


Matching 
Letterlike Forms 


Recall of Digits 
Recall of Objects 


Recognition 
of Pictures 





n/a = not applicable. 


Abilities Measured 


Receptive language, understanding 
oral instructions 

Expressive language, knowledge 
of names 


Nonverbal reasoning, matching pictures 


with common themes 


Nonverbal, spatial visualization with 
colored blocks and squares 


Design copying, fine-motor coordination, 


visual-spatial matching 


Knowledge of number and quantitative 


concepts 


Spatial orientation, visual-perceptual 
matching with blocks 


Spatial relationships, visual discrimination 


of similar forms 


Short-term auditory memory for sequences 


of numbers 


Short-term learning and verbal recall 
of pictures 


Short-term visual memory, recognition 


of familiar objects 


Contribution 
to Composite 


Verbal, GCA 
Verbal, GCA 
Nonverbal, GCA 
Nonverbal, GCA 
Nonverbal, GCA 


GCA 


n/a 
n/a 
n/a 
n/a 


n/a 


TABLE 5.11 Correlations between DAS and WISC-III Composites 
for 53 Children with Learning Disabilities 


DAS Composite 
Verbal 


Nonverbal Reasoning 


Spatial 
GCA 


WISC-III Mean 
WISC-III SD 


WISC-III Composite 
VIQ PIQ FSIQ 
77 52 72 
55 .65 67 
50 .67 .64 
.68 A .78 
89.4 93.2 89.7 
13.8 14.2 13:2 


DAS 
Mean SD 
90.2 12.0 
83.5 12.5 
93.6 17.0 
87.2 14.8 





Source: Reprinted with permission from Dumont, R., Cruse, C., Price, L., & Whelley, P. (1996). The rela- 
tionship between the Differential Ability Scales (DAS) and the Wechsler Intelligence Scale for Children- 
Third Edition (WISC-III) for students with learning disabilities. Psychology in the Schools, 33, 203-209. 
Copyright © John Wiley & Sons, Inc. Reprinted by permission of John Wiley & Sons, Inc. 
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more strongly than dissimilar components for the 
two tests. Also, the table reveals that overall scores 
are very similar, on average, for. the DAS (mean 
GCA of 87.2) and the WISC-III (mean IQ of 89.7). 
DiCerbo and Barona (2000), Elliott (1997), and 
Gordon and Elliott (2001) describe addittonal va- 
lidity studies for this fine instrument. 


AND PRESCHOOL ASSESSMENT 


The history of child assessment has shown time and 
again that, in general, test scores earned in the first 
year or two of life show minimal predictive valid- 
ity. For example, in her review of infant intelligence 
testing, Goodman (1990) concludes: 


| PRACTICAL UTILITY OF INFANT 


If the successful prediction of adolescent and 
adult intelligence from early childhood scores is 
one of the great accomplishments of applied psy- 
chology, then the failure to predict intelligence 
from infancy to early childhood ranks as one of its 
greatest failures. 


Given this dismal record of repeated failures of pre- 
dictive validity, we must ask a difficult question: 
What is the purpose and practical utility of infant 
assessment? In fact, infant tests do have an impor- 
tant but limited role to play. We return to that issue 
after a review of predictive studies. 


Predictive Validity of Infant 
and Preschool Tests 


With heterogeneous samples of normal children, 
the general finding is that infant test scores corre- 
late positively but unimpressively with childhood 
test scores (Goodman, 1990; McCall, 1979). A few 
studies are more optimistic in tone (e.g., Wilson, 
1983), but most researchers agree with McCall’s 
(1976) conclusion: 


Generally speaking, there is essentially no correla- 
tion between performance during the first six 
months of life with IQ score after age 5; the corre- 
lations are predominantly in the 0.20s for assess- 
ments made between 7 and 18 months of life when 
one is predicting IQ at 5-18 years; and it is not 


until 19-30 months that the infant test predicts later 
IQ in the range of 0.40-0.55. 


McCall (1979) reconfirmed his original conclusion 
in a later review, which we have summarized here. 
The reader will notice in Table 5.12 that the corre- 
lations between infant and school-age test scores 
do not exceed .40 until the subjects are at least 19 
months of age for the initial testing. 

The findings with preschool tests are somewhat 
more positive in tone. The correlation between 
preschool test results and later IQ is typically 
strong, significant, and meaningful. The simplest 
way to investigate this question is to measure the 
stability of IQ results in longitudinal studies. In 
Table 5.13, we have summarized the age-to-age sta- 
bility of children’s IQ scores on the Stanford-Binet 
from the Fels Longitudinal Study, an early, classic 
follow-up investigation of children’s intellectual 
and emotional development (Sontag, Baker, & Nel- 
son, 1958). The lowest correlation in this table is 
.43, and that is between IQ tested at age 4 and again 
at age 12. What stands out in the table is the ro- 
bustness of the link between IQ in preschool and 
later childhood. The older the child at initial test- 
ing, the stronger the relationship with later IQ. In 
fact, the results suggest that IQ becomes reasonably 
stable, on average, by 8 years of age. 


TABLE 5.12 Summary of Correlations between 
Infant and Childhood Intelligence Test Scores in 
Normal Subjects 


Age of Childhood Test (Years) 


Age of Initial 
Infant Test (Months) 3-4 5-7 8-18 
1-6 21 09 06 
7-12 32 .20 25 
13-18 50 34 32 
19-30 59 39 49 





Source: Adapted by permission from McCall, R. B. (1979). The 
development of intellectual functioning in infancy and the predic- 
tion of later IQ. In J. D. Osofsky (Ed.), Handbook of infant devel- 
opment. New York: Wiley. Copyright © 1979 John Wiley & Sons, 
Inc. Reprinted by permission of John Wiley & Sons, Inc. 
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TABLE5.13 Stability of IQ from 3 to 12 Years of Age 


Age at Retesting 
Age at 

Initial Testing 4 5 6 7 8 9 10 11 12 
3 .83 br wi) .64 .60 .63 54 ‚Sl ‚46 
4 .80 .85 .70 .63 .66 els) .50 43 
a 87 .83 .79 .80 70 .63 .62 
6 83 49 81 12 .67 .67 
7 91 83 .82 .76 
8 .92 ‚90 .84 83 
9 .90 82 81 
10 .90 .88 
11 .90 





Source: Adapted with permission from Sontag, L. W., Baker, C., & Nelson, V. (1958). Mental growth 
and personality development: A longitudinal study. Monographs of the Society for Research in Child 
Development, 23 (Whole No. 68). Copyright © by The Society for Research in Child Development, Inc. 


Collectively, these findings confirm that in- 
fant tests generally have poor prognostic value, 
whereas preschool tests are moderately predictive of 
later intelligence. This brings us back to the question 
posed at the beginning of this section: What is the 
purpose and practical utility of infant assessment? 


Practical Utility of the Bayley-Il 
and Other Infant Scales 


The most important and justifiable use of infant 
tests is in screening for developmental disabilities. 
Although existing infant tests are poor predictors 
of childhood intelligence, an exception to this rule 
is encountered for infants who obtain a very low 
score on the Bayley-II or other screening devices. 
For example, infants who score two standard devi- 
ations below the mean on the Bayley, particularly 
on the Mental Scale, have a high probability of test- 
ing in the ranges of those with mental retardation 
later in life (Self & Horowitz, 1979; Goodman, 
Malizia, Durieux-Smith, MacMurray, & Bernard, 
1990). 

With at-risk children, the correlation between 
infant test scores and later childhood IQ is much 
stronger than for samples of normal children. Mc- 


Call (1983) determined that the median correlation 
between infant test scores and childhood IQ: at 
seven-year follow-up was a healthy .48. The most 
consistent finding is that a very low score on an in- 
fant test—two standard deviations below the mean 
and lower—accurately prognosticates low IQ in 
childhood (Frankenburg, 1985). For example, stud- 
ies with the Denver Developmental Screening Test- 
Revised (since revised and published- as the 
Denver-ID revealed a false positive rate of only 5 to 
11 percent, meaning that infants and preschoolers 
identified as at-risk rarely achieve normal range 
functioning. Studies with the Bayley Scales also 
conform to this pattern (e.g., VanderVeer & 
Schweid, 1974). 


New Approaches to Infant Assessment 


Lewis has argued that traditional infant tests over- 
look early information-processing behaviors, such 
as recognition memory and attentiveness to the en- 
vironment, that might better predict childhood cog- 
nitive function (Lewis & Sullivan, 1985). In one 
study, simple visual habituation to a novel stimulus 
(measured by the duration of fixation) assessed at 
three months of age correlated .61 with the Bayley 
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Mental score at 24 months of age (Lewis & Brooks- 
Gunn, 1981). Using a similar paradigm, Fagan has 
reported comparable findings (Fagan, 1984; Fagan 
& Shepherd, 1986). For example, in one study he 
tested infant recognition memory at four to seven 
months with the habituation method (Fagan’& Mc- 
Grath, 1981). In this approach, the infants first ob- 
served a picture of a baby’s face for a short period 
of time and were then shown the same picture along- 
side an unfamiliar picture (e.g., picture of a bald- 
headed man). The investigators kept careful track of 
which picture the infants looked at most. The logic 
of the procedure is simple: Staring mainly at the new 
picture signifies that an infant recognizes the old 
picture; that is, an infant with good recognition 
memory prefers to look at something new. Prefer- 
ence for novelty—as measured by visual fixation 
time on the new picture—thus becomes an index of 
early recognition memory. Years later, the investi- 
gators administered the Peabody Picture Vocabulary 
Test (PPVT) to gauge early childhood intelligence. 
Infant recognition memory scores and early child- 
hood PPVT scores correlated .37 at four years of 
age and .57 at seven years of age. These correlations 
probably underestimate the predictive validity of in- 
fant memory tests insofar as the index’ of infant 
memory was an unreliable procedure based upon a 
small number of test items. Furthermore, the re- 
searchers assessed normal infants, which attenuated 
the correlations between predictor and criterion. 
Infant cognitive measures possess a great deal 
of promise as predictors of childhood intelligence 
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(Bornstein, 1994; Fagan & Haiken-Vasen, 1997). 
In the years ahead, we may witness the emergence 
of entirely new types of infant assessment devices 
based on the measurement of early memory, habit- 
uation, and attentional capacities instead of senso- 
rimotor abilities. A first step in this direction is 
Fagan’s Test of Infant Intelligence (FTII, Fagan & 
Shepherd, 1986), a simple instrument based upon 
the methods previously outlined for measuring in- 
fant novelty preference and recognition memory. 
The FTI yields a composite score that is based 
upon preference for novelty—as measured by vi- 
sual fixation time on a new picture—averaged over 
several trials. The procedure shows very high in- 
terrater agreement (O’ Neill, Jacobson, & Jacobson, 
1994). 

Initial validity studies of the FTII as a predictor 
of childhood intelligence are mixed in outcome. In 
one sample of 200 infants, the FTII scores obtained 
at 7 to 9 months correlated only .32 with Stanford- 
Binet IQ at age 3 (DiLalla, Thompson, Plomin, and 
others, 1990). In another recent study, overall cor- 
relations between FTII scores obtained at 7 to 9 
months and WPPSI-R IQ at age 5 were around .2 
for two Norwegian samples of healthy children 
(Andersson, 1996). These correlations do not sup- 
port the use of the test as a screening tool in non- 
risk populations. However, the test may function 
better when used with at-risk infants. Nonetheless, 
further research is needed before we abandon tra- 
ditional infant measures in favor of the Fagan test 
and similar measures. 


SUMMARY 


1. The infant and preschool period extends 
from birth to about age 6. An important application 
of infant and preschool tests is to help answer ques- 
tions about developmental delay. Most infant tests 
(ages birth to 2¥2) load heavily on sensory and 
motor skills. Preschool tests (ages 2% to 6) tend to 
tap cognitive skills to a significant degree. 


2. The Gesell Developmental Schedules 


(GDS) gauge the developmental progress of babies 
from 4 weeks to 60 months of age. 


3. The Neonatal Behavioral Assessment Scale 
(NBAS) assesses the newborn infant’s behavioral 
repertoire on 28 behavior items (scored on a 9-point 
scale), 18 reflexes (scored on a 4-point scale), and 
7 qualities of responsiveness. The instrument is sen- 
sitive to prenatal cocaine exposure and other neu- 
rotoxins. The NBAS is also used to sensitize parents 
to the uniqueness of their infants. 


4. The Ordinal Scales of Psychological De- 
velopment were designed as a Piagetian-based 
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measure of intellectual development (ages 2 weeks 
to 2 years). The scales measure development of ob- 
ject permanence, means-ends, vocal and gestural 
imitation, operational causality, object relations in 
space, and schemes for relating to objects. 


5. The Bayley Scales of Infant Development- 
Il assess mental and motor development of children 
from 1 month to 42 months of age. The Bayley is 
very carefully standardized and highly reliable. 
Like other infant tests, very low scores predict an 
intellectually disabled outcome in later childhood, 
while near-normal and higher scores possess little 
predictive validity. 

6. The Wechsler Preschool and Primary Scale 
of Intelligence-Revised (WPPSI-R) is designed for 
children ages 3 years to 7 years and 3 months. The 
WPPSI-R contains three subtests not found on 
other Wechsler Scales: Sentences (oral memory); 
Geometric Designs (design copying); and Animal 
Pegs (coded placement of pegs). 

7. The Standford-Binet: Fourth Edition 
(SB:FE) is a useful instrument for preschool assess- 
ment. Although the test is designed to yield four fac- 
tor scores, Sattler’s (1988) two-factor solution to the 


reporting of SB:FE scores (Verbal Comprehension: 


and Nonverbal Reasoning/Visualization) is the pre- 
ferred approach with preschoolers. 

8. The Kaufman Assessment Battery for 
Children (K-ABC), used for children ages 2:6 
through 12:5 years, is a combined measure of in- 
telligence and achievement based upon the distinc- 


tion between sequential processing (serial or tem- 
poral arrangement of stimuli) and simultaneous 
processing (synthesis and organization of stimuli in 
an immediate or wholistic fashion). 


9. The McCarthy Scales of Children’s Abili- 
ties are designed for children ages 2:6 to 8:6 years. 
The 18 subtests produce five different subscores 
and a General Cognitive Index (GCI) akin to an IQ. 
The subscores include verbal, perceptual-perfor- 
mance, quantitative, memory, and motor. Unfortu- 
nately, these five areas are not confirmed by 
independent factor analyses. 


10. Designed for children ages 2 years 6 
months through 17 years 11 months, the Differen- 
tial Ability Scales consists of 17 cognitive subtests 
and 3 conormed achievement tests for school- 
aged children. Initial research indicates that the 
DAS yields reliable and reasonably independent 
subtest scores useful in the assessment of learning 
disability. 

11. In general, infant test scores correlate pos- 
itively but weakly with childhood test scores. In- 
fant test scores must be interpreted with caution. An 
exception is very low infant test scores on such de- 
vices as the Bayley-II, which reliably predict 
developmental disability in childhood. 


12. Tests of recognition memory in infants 
show promise as predictors of childhood intelli- 
gence. For example, in Fagan’s studies, indices of 
simple visual habituation in infancy correlated .57 
with picture vocabulary scores at age 7. 
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Summary 


Key Terms and Concepts 


| en testing is one of the major achieve- 
ments of psychology in the twentieth century. In 
response to the success of the Binet-Simon scales 
in the early 1900s, psychologists developed and 
refined dozens of individual tests of intelligence 
patterned after this pathbreaking instrument. Ex- 
plosive growth was also observed in group tests of 
intelligence, fostered by the enthusiastic accep- 
tance of the Army Alpha and Beta tests during and 
after World War I. With only a few exceptions, con- 
temporary individual and group tests of intelligence 
owe their lineage to Binet, Simon, and the Army 
testing program. 
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The purpose of this chapter is to provide an 
overview of noteworthy approaches to the testing 
of individual and group intelligence. We survey 
prominent individual tests in Topic 6A and then 
close the chapter with a review of group intelli- 
gence tests in Topic 6B. Even though this text de- 
votes three full chapters to the fascinating and 
emotionally charged topic of intelligence testing, 
we make no pretext that the coverage is compre- 
hensive. An exhaustive analysis of intelligence test- 
ing is simply beyond the scope of this or any other 
basic reference. New and revised tests appear prac- 
tically every month, and thousands of new research 


findings are published every year. We have chosen 
to review tests that are widely used or that illustrate 
interesting developments in theory or method. 
Readers can find information on additional tests in 
the Mental Measurements Yearbook series, now 
published every three or four years by the Buros 
Institute (e.g., Conoley & Kramer, 1989, 1992; 
Impara & Plake, 1998; Plake & Impara, 2001). The 
Encyclopedia of Human Intelligence (Sternberg, 
1994) is also a good source of information on indi- 
vidual and group tests of intelligence. 


INTELLIGENCE TESTS 


The individual intelligence tests reviewed in this 
topic include the following: 


¢ Wechsler Adult Intelligence Scale-III (WAIS-II) 

e Wechsler Intelligence Scale for Children-III 
(WISC-III) 

e Stanford-Binet: Fifth Edition (SB5) 

e Detroit Test of Learning Aptitude-4 (DTLA-4) 

* Kaufman Brief Intelligence Test (K-BIT) 


i ORIENTATION TO INDIVIDUAL 


Another promising test that we do not review in 
depth is the Kaufman Adolescent and Adult Intelli- 
gence Test (KAIT). Published in 1992, the KAIT is 
a recent arrival on the testing scene (Dumont & 
Hagberg, 1994; Shaughnessy & Moore, 1994). 
Kaufman and Kaufman (1997) list several advan- 
tages of the KAIT, including its psychometric foun- 
dation in the g.-g, distinction proposed by John 
Horn and his followers. The KAIT also is appeal- 
ing because of its brevity: The test provides highly 
reliable indices of intelligence in two-thirds the time 
needed for most batteries. Along with the preschool 
tests presented in the previous topic, the previously 
listed instruments probably account for 98 percent 
of the intellectual assessments conducted in the 
United States. 

The Wechsler scales have dominated intelli- 
gence testing in recent years, but they are by no 
means the only viable choices for individual as- 
sessment. Many other instruments measure general 
intelligence just as well—some would say better. 
Consider the implications of a now familiar obser- 
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vation: For large, heterogeneous samples, scores on 
any two mainstream instruments (e.g., Wechsler, 
Stanford-Binet, McCarthy, Kaufman scales) typi- 
cally correlate 0.80 to 0.90. Often, the correlation 
between two mainstream instruments is nearly as 
high as the test-retest correlation for either instru- 
ment alone. For purposes of producing a global 
score, it would appear that any well-normed main- 
stream intelligence test will suffice. 

But producing an overall score is not the only 
goal of assessment. In addition, the examiner usu- 
ally desires to gain an understanding of the sub- 
ject’s intellectual functioning. For this purpose, the 
overall IQ is important, but there are instances in 
which the global score may be irrelevant or even 
misleading. To understand a referral’s intellectual 
functioning, the examiner should also inspect the 
subtest scores in search of hypotheses that might 
explain the unique functioning of that individual. 
Of course, examiners need to undertake subtest 
analysis cautiously, armed with research-based 
findings on the nature and meaning of subtest scat- 
ter for the test in use (Gregory, 1994b; McLean, 
Kaufman, & Reynolds, 1989; McDermott, Fan- 
tuzzo, & Glutting, 1990). 

If the examiner’s goal is to understand intellec- 
tual functioning and not merely to determine an 
overall score, the differences between tests become 
quite real. Every instrument approaches the mea- 
surement of intelligence from a different perspective 
and yields a distinctive set of subtest scores. Fur- 
thermore, a test well suited for one referral issue 
might perform abysmally in another context. For ex- 
ample, the WAIS-III performs admirably in the test- 
ing of mild mental retardation, but contains too few 
simple items for the effective assessment of persons 
with moderate or severe developmental disability. 

A central axiom of assessment is that the 
choice of a testing instrument should be based on 
knowledge of its strengths and weaknesses as they 
pertain to the referral question. Put simply, the 
skilled examiner does not blindly rely upon a sin- 
gle test for every referral! Instead, the skilled ex- 
aminer flexibly chooses one or more instruments in 
light of the perceived assessment needs of the ex- 
aminee. Each of the tests discussed in this topic has 
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its special merits and also its particular shortcom- 
ings. The test user must know these strong and 
weak facets in order to choose the instruments best 
suited for each unique referral. 


THE WECHSLER SCALES 
OF INTELLIGENCE 


Beginning in the 1930s, David Wechsler, a psy- 
chologist at Bellevue Hospital in New York City, 
conceived a series of elegantly simple instruments 
that virtually defined intelligence testing in the mid- 
to late twentieth century. His influence on intelli- 
gence testing is exceeded only by the pathbreaking 
contributions of Binet and Simon. It is fitting that 
we begin the survey of individual tests with a his- 
torical summary of the Wechsler tradition, followed 
by a discussion of individual instruments. 


Origins of the Wechsler Tests 


Wechsler began work on his first test in 1932, seek- 
ing to devise an instrument suitable for testing the 
diverse patients referred to the psychiatric section 
of Bellevue Hospital in New York (Wechsler, 
1932). In describing the development of his first 
test, he later wrote, “Our aim was not to produce a 
set of brand new tests but to select, from whatever 
source available, such a combination of them as 
would meet the requirements of an effective adult 
scale” (Wechsler, 1939). In fact, the content of his 
scales was largely inspired by earlier efforts such as 
the Binet scales and the Army Alpha and Beta tests 
(Frank, 1983). Readers who peruse Psychological 
Examining in the United States Army, a volume 
edited by Yerkes (1921) just after World War I, 
might be astonished to discover that Wechsler pur- 
loined dozens of test items from this source, many 
of which have survived to the present day in con- 
temporary revisions of the Wechsler tests. Wechsler 
was not so much a creative talent as a pragmatist 
who fashioned a new and useful instrument from 
the spare parts of earlier, discontinued attempts at 
intelligence testing. 

The first of the Wechsler tests, named the 
Wechsler-Bellevue Intelligence Scales, was pub- 


lished in 1939, In discussing the rationale for his 
new test, Wechsler (1941) explained that existing 
instruments such as the Stanford-Binet were woe- 
fully inadequate for assessing adult intelligence. 
The Wechsler-Bellevue was designed to rectify sev- 
eral flaws noted in previous tests: 


¢ The test items possessed no appeal for adults. 

« Too many questions emphasized mere manipu- 
lation of words. 

e The instructions emphasized speed at the ex- 
pense of accuracy. 

e The reliance upon mental age was irrelevant to 
adult testing. 


To correct these shortcomings, Wechsler de- 
signed his test specifically for adults, added perfor- 
mance items to balance verbal questions, reduced 
the emphasis upon speeded questions, and invented 
a new method for obtaining the IQ. Specifically, he 
replaced the usual formula 


____ Mental Age 
Chronological Age 


with a new age-relative formula 


IQ = Attained or Actual Score 
Expected Mean Score for Age 


This new formula was based on the interesting pre- 
sumption—stated in the form of an axiom—that IQ 
remains constant with normal aging, even though 
raw intellectual ability might shift or even decline. 
The assumption of IQ constancy is basic to the 
Wechsler scales. As Wechsler (1941) put it: 


The constancy of the I.Q. is the basic assumption 
of all scales where relative degrees of intelligence 
are defined in terms of it. It is not only basic, but 
absolutely necessary that I.Q.’s be independent of 
the age at which they are calculated, because un- 
less the assumption holds, no permanent scheme 
of intelligence classification is possible. 


Although Wechsler’s view has been largely ac- 
cepted by contemporary test developers, it is 
important to stress that the assumption of IQ in- 
variance with age is really a statement of values, a 


philosophical choice, and not necessarily an inher- 
ent characteristic of human nature. 

Wechsler also hoped to use his test as an aid in 
psychiatric diagnosis. In pursuit of this goal, he di- 
vided his scale into separate verbal and performance 
sections. This division allowed the examiner to com- 
pare an examinee’s facility in using words and sym- 
bols (verbal subtests) versus the ability to manipulate 
objects and perceive visual patterns (performance 
subtests). Large differences between verbal ability 
(V) and performance ability (P) were thought to be 
of diagnostic significance. Specifically, Wechsler be- 
lieved that organic brain disease, psychoses, and 
emotional disorders gave rise to a marked V > P pat- 
tern, whereas adolescent psychopaths and persons 
with mild mental retardation yielded a strong P > V 
pattern. Subsequent research demonstrated many ex- 
ceptions to these simple diagnostic rules. Nonethe- 
less, the distinction between verbal and performance 
skills has proved valid and useful for other purposes, 
such as analyzing brain-behavior relationships and 
studying age effects on intelligence. Wechsler’s arm- 
chair division of subtests into verbal and perfor- 
mance sections ranks as perhaps his most enduring 
contribution to contemporary intelligence testing 
(Kaufman, Lichtenberger, & McLean, 2001). 


General Features of the Wechsler Tests 


Including revisions, David Wechsler and his fol- 
lowers produced 10 intelligence tests in a span of 
about 60 years. A major reason for the success of 
these instruments was that each new test or revision 
remained faithful to the familiar content and format 
first introduced in the Wechsler-Bellevue. By stick- 
ing with a single successful formula, Wechsler en- 
sured that examiners could switch from one 
Wechsler test to another with minimal retraining. 
This was not only good psychometrics but also 
shrewd marketing insofar as it guaranteed several 
generations of faithful test users. 

The various versions and editions of the Wech- 
sler tests possess the following common features: 


+ Ten to fourteen subtests. The multi-subtest ap- 
proach allows the examiner to analyze intraindi- 
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vidual strengths and weaknesses rather than just 
to compute a single global score. As the reader 
will learn subsequently, the pattern of subtest 
scores may convey useful information not evi- 
dent from the overall level of performance. 

e A Verbal Scale composed of five or six subtests 
and a Performance Scale composed also of five 
or six subtests. With this division, the examiner 
can assess verbal comprehension and perceptual 
organization skills separately. The pattern of 
abilities on these two factors of intelligence may 
have a bearing on the functional integrity of the 
left and right hemispheres of the brain, as well as 
indicating vocational strengths and weaknesses, 
as discussed in the following. 

e A common metric for IQ and Index scores. The 

mean for IQ and Index scores is 100 and the stan- 

dard deviation is 15 for all tests and all age 
groups. In addition, the scaled scores on each 
subtest have a mean of 10 and a standard devia- 
tion of approximately 3, which permits the ex- 
aminer to analyze the subtest scores of the 
examinee for relative strengths and weaknesses. 

Common subtests for different ages. For exam- 

ple, the preschool, child, and adult Wechsler tests 

(WPPSI-R, WISC-III, and WAIS-II) all contain 

acommon core of the same eight subtests (Table 

6.1). An examiner who masters the administra- 

tion of one core subtest on any of the Wechsler 

tests (such as the Information subtest on the 

WAIS-II) easily can transfer this skill within the 

Wechsler family of intellectual measures. 


THE WECHSLER SUBTESTS: 
DESCRIPTION AND ANALYSIS 


Wechsler (1939) defined intelligence as “the ag- 
gregate or global capacity of the individual to act 
purposefully, to think rationally and to deal effec- 
tively with his environment.” He also believed that 
we can only know intelligence by what it enables a 
person to do. In designing his tests, then, Wechsler 
selected components to represent a wide array of 
underlying abilities so as to estimate the global 
capacity of intelligence. Furthermore, he asked 
his subjects to do things, not merely to answer 
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TABLE 6.1 Subtest Composition of the 
Wechsler Scales 


WPPSI-R WISC-III WAIS-III 


Verbal Scales 
Information 
Digit Span 
Vocabulary 
Arithmetic 
Comprehension 
Similarities 
Sentences 
Letter-Number 

Sequencing 

Performance Scales 
Picture Completion x x 
Picture Arrangement x 
Block Design x x 
Matrix Reasoning 
Object Assembly x x 
Coding/Digit Symbol x 
Mazes x x 
Geometric Design x 
Symbol Search x x 
Animal Pegs x 


x 


x x xX xX XxX 
x x xX xX KX xX 
x xX xX X XK xX 


x 


x x xX xX XxX X 





Note: The “core” subtests common to all Wechsler scales are in 
boldface. 


questions. The Wechsler subtests are quite diverse 
and often rely upon what Wechsler referred to as 
“mental productions.” 

We present here a description of subtests from 
the WISC-III and WAIS-III. We also analyze the 
abilities tapped by each subtest and offer research- 
based comments. The reader is referred to Topic 5B 
for a description of three subtests unique to the 
WPPSI-R (Sentences, Geometric Designs, and An- 
imal Pegs). The verbal subtests are listed first. 


Information 


Factual knowledge of persons, places, and common 
phenomena is tested here. Questions for children 
are like the following: 
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“How many eyes do you have?” 
“Who invented the telephone?” 
“What causes a solar eclipse?” 
“Which is the largest planet?” 


Questions for adults are similar, but progress to 
higher levels of difficulty. Difficult questions on the 
adult Information subtest resemble: 


“Which is the most common element in air?” 
“What is the population of the world?” 

“How does fruit juice get converted to wine?” 
“Who wrote Madame Bovary?” 


Information items test general knowledge 
normally available to most persons raised in the 
cultural institutions and educational systems of 
Western industrialized nations. Indirectly, this sub- 
test measures learning and memory skills insofar as 
subjects must retain knowledge gained from formal 
and informal educational opportunities in order to 
answer the Information items. 

Information is usually regarded as one of the 
best measures of general ability among the Wech- 
sler subtests (Kaufman, McLean, & Reynolds, 
1988). For example, the WAIS-III manual reveals 
that Information typically has the second or third 
highest correlation with Full Scale IQ across the 13 
age groups (Tulsky, Zhu, & Ledbetter, 1997). In- 
formation consistently loads strongly on the first 
factor identified in factor analyses of the WAIS-III 
subtest correlations (see the following). The first 
factor is labeled Verbal Comprehension. However, 
Information tends to reflect formal education and 
motivation for academic achievement and may 
therefore yield spuriously high ability estimates for 
perpetual students and avid readers. 


Digit Span 


Digit Span consists of two separate sections, Digits 
Forward and Digits Backward. In Digits Forward, 
the examiner reads a series of digits at one per sec- 
ond, then asks the subject to repeat them. If the sub- 
ject answers correctly on two consecutive trials of 
the same length, the examiner proceeds to the next 
series, which is one digit longer, up to a maximum 


length of nine digits. For Digits Backward, a similar 
procedure is used, except the examinee must repeat 
the digits in reverse order, up to a maximum length 
of eight digits. For example, the examiner reads: 


“6-1-3-4-2-8-5” 


and the subject tries to repeat the numbers in the re- 
verse order: 


“5-8-2-4-3-1-6.” 


Digit Span is a measure of immediate auditory 
recall for numbers. Facility with numbers, good at- 
tention, and freedom from distractibility are re- 
quired. Performance on this subtest may be affected 
by anxiety or fatigue, and many clinicians have 
noted that patients hospitalized for medical or psy- 
chiatric reasons frequently perform poorly on Digit 
Span. 

Digits Forward and Digits Backward may as- 
sess fundamentally different abilities. Digits For- 
ward seems to require the examinee to access an 
auditory code in sequential fashion. In contrast, to 
perform Digits Backward, the examinee must form 
an internal visual memory trace from the orally pre- 
sented numerical sequences and then visually scan 
from end to beginning. Digits Backward is clearly 
the more complex test; not surprisingly, it loads 
higher on general intelligence than does Digits For- 
ward (Jensen & Osborne, 1979). Gardner (1981) 
argues that examiners should supplement standard 
reporting procedures and list separate subscores for 
Digit Span. He presents separate means, standard 
deviations, and percentile ranks on Digits Forward 
and Backward for children ages 5 to 15. 


Vocabulary 


The subject is asked to define up to several dozen 
words of increasing difficulty while the examiner 
writes down each response verbatim. For example, 
on an easy item the examiner might ask, “What is 
a cup?” and the examinee would get partial credit 
for answering, “You drink with it” and full credit 
for answering, “It has a handle, holds liquids, and 
you drink from it.” For adults and bright children, 
the advanced items on the Wechsler Vocabulary 
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subtests can be very challenging, on a par with tinc- 
ture, obstreperous, and egregious. 

Vocabulary is learned largely in context from 
reading books and listening to others. It is a rare in- 
dividual who picks up vocabulary by reading the 
dictionary or memorizing word lists from the 
“Building Your Wordpower” section of popular 
magazines. In the main, a person’s vocabulary is a 
measure of sensitivity to new information and the 
ability to decipher meanings based on the context 
in which words are encountered. Precisely because 
the acquisition of word meaning depends upon con- 
textual inference, the Vocabulary subtest turns out 
to be the single best measure of overall intelligence 
on the Wechsler scales (Gregory, 1999). This is a 
surprise to many laypersons who regard vocabulary 
as merely synonymous with educational exposure 
and therefore a mediocre index of general intelli- 
gence. However, there is simply no denying the em- 
pirical evidence: Vocabulary has the highest subtest 
correlation with Full Scale IQ on both the WISC- 
III (combined age groups) and also the WAIS-III 
(for 12 of the 13 age groups). 


Arithmetic 


Except for the very easiest items for young people 
or persons who have mental retardation, the Arith- 
metic subtest consists of orally presented mathe- 
matics problems. The examinee must solve the 
problems without paper or pencil within a time 
limit (usually 30 to 60 seconds). The simple items 
stress fundamental operations of addition or sub- 
traction, for example: 


“If you have fifteen apples and give seven away, 
how many are left?” 


The more difficult items require proper conceptu- 
alization of the problem and the application of two 
arithmetic operations, for example: 


“John bought a stereo that was marked down 15 
percent from the original sales price of $600. 
How much did John pay for the stereo?” 


Although the mathematical requirements of the 
Arithmetic items are not excessively demanding, 
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the necessity of solving the problems mentally 
within a time limit makes this subtest quite chal- 
lenging for most examinees. In addition to rudi- 
mentary arithmetic skills, successful performance 
on Arithmetic requires high levels of concentration 
and the ability to maintain intermediate calcula- 
tions in short-term memory. In factor analyses of 
the WISC-III and WAIS-III, Arithmetic often loads 
on a third factor variously interpreted as Freedom 
from Distractibility or Working Memory. 


Comprehension 


The Comprehension subtest is an eclectic collection 
_of items that require explanation rather than mere 
factual knowledge. The easy questions stress com- 
mon sense, whereas the more difficult questions 
require an understanding of social and cultural 
conventions. On the WAIS-III, the two most diffi- 
cult questions require the examinee to interpret 
proverbs. 

An easy item on Comprehension is of the form 
“Why do people wear clothes?” Difficult items re- 
semble the following: 


“What does this saying mean: ‘A bird in the 
hand is worth two in the bush.’ ” 

“Why are Supreme Court Judges appointed for 
life?” 


Comprehension would appear to be, in part, 
a measure of “social intelligence” in that many 
items tap the examinee’s understanding of social 
and cultural conventions. Sipps, Berry, and Lynch 
(1987) found that Comprehension scores were 
moderately related to measures of social intelli- 
gence on the California Psychological Inventory. Of 
course, a high score signifies only that the examinee 
is knowledgeable about social and cultural conven- 
tions; choosing right action may or may not flow 
from this knowledge. However, recent studies 
by Campbell and McCord (1996) and Lipsitz, 
Dworkin, and Erlenmeyer-Kimling (1993) provide 
no support for the commonly accepted clinical lore 
that Comprehension scores are sensitive to social 
functioning. 
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Similarities 


In this subtest, the examinee is asked questions of 
the type, “In what way are shirts and socks alike?” 
The Similarities subtest evaluates the examinee’s 
ability to distinguish important from unimportant 
resemblances in objects, facts, and ideas. Indirectly, 
these questions assess the assimilation of the con- 
cept of likeness. The examinee must also possess 
the ability to judge when a likeness is important 
rather than trivial. For example, “shirts” and 
“socks” are alike in that both begin with the letter 
s, but this is not the essential similarity between 
these two items. The important similarity is that 
shirts and socks are both exemplars of a concept, 
namely, “clothes.” As this example illustrates, Sim- 
ilarities can be thought of as a test of verbal con- 
cept formation. 

We turn now to a description and analysis of 
Wechsler performance subtests. With the exception 
of Matrix Reasoning on the WAIS-II, all of the per- 
formance subtests are timed, and for most the ex- 
aminee earns bonus points for quick performance. 


Letter-Number Sequencing 


This is a new subtest found only on the WAIS-II. 
The examiner orally presents a series of létters and 
numbers that are in random order. The examinee 
must reorder and repeat the list by saying the num- 
bers in ascending order and then the letters in al- 
phabetical order. For example, if the examiner says 
“R-3-B-5-Z-1-C,” the examinee should respond 
“1-3-5-B-C-R-Z.” This test measures attention, 
concentration, and freedom from distractibility. To- 
gether with Arithmetic and Digit Span, this subtest 
contributes to the Working Memory Index score on 
the WAIS-II (see the following). Donders, Tulsky, 
and Zhu (2001) found the Letter-Number Se- 
quencing subtest to be highly sensitive to the effects 
of moderate and severe traumatic brain injury. 


Picture Completion 


For this subtest, the examiner asks the subject to 
identify the “important part” that is missing from a 





FIGURE 6.1 Picture Completion Item Similar to Those 
Found on the WAIS-III 


picture. For example, a simple item might be of 
this type: a picture of a table with one leg missing. 
The items get harder and harder; testing continues 
until the examinee misses several in a row. Figure 
6.1 depicts an item similar to those found on the 
WAIS-II. 
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Although Picture Completion is included on the 
performance half of each Wechsler test, the abilities 
required for this subtest overlap only modestly with 
the classic measures of performance intelligence 
(e.g., Block Design). For one thing, successful per- 
formance on Picture Completion largely involves 
access to long-term memory rather than perceptual- 
manipulative skill. True, the examinee must have 
good attention to visual detail. But high scores 
mainly reflect the ability to compare each drawing 
with similar items or situations stored in long-term 
memory. In sum, Picture Completion really doesn’t 
require a performance component. The examinee 
needs to verbalize the missing element or merely 
point to the section of the drawing that is anomalous. 
The Picture Completion subtest presupposes that 
the examinee has been exposed to the object or 
situation represented. For this reason, Picture Com- 
pletion may be inappropriate for culturally disad- 
vantaged persons. 


Picture Arrangement 


In this subtest, several panels of nonverbal cartoon 
strips are laid down out of order by the examiner. 
The examinee’s task is to put the panels together in 
the correct order to tell a sensible story. Figure 6.2 
depicts a picture arrangement task, such as might 
be found on the WAIS-II. 





FIGURE 6.2 Picture Arrangement Item Similar to Those Found on the WAIS-III 
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Although Picture Arrangement is grouped with 
the Performance tasks, it loads about equally on the 
verbal and performance components revealed in 
factor-analytic studies of subtest intercorrelations 
(e.g., Silverstein, 1982a). The abilities tapped by 
Picture Arrangement are complex and multifac- 
eted. Before sorting the pictures, the examinee 
must be able to decipher the gestalt of the entire 
story from its disarranged elements. This subtest 
also measures sequential thinking and the ability to 
see relationships between social events. On the 
WAIS-III; several of the Picture Arrangement sto- 
ries have humorous themes. As a result, social so- 
phistication and a sense of humor are required for 
successful performance. 


Block Design 


On the Block Design subtest, the examinee must re- 
produce two-dimensional geometric designs by 
proper rotation and placement of three-dimensional 
colored blocks. This subtest was depicted in Topic 
2B, The Testing Process. For all of the Wechsler 
scales, the first few Block Design items can be 
solved through trial and error. However, the more 
difficult items require the analysis of spatial rela- 
tions, visual-motor coordination, and the rigid ap- 
plication of logic. Block Design demands much 
more problem-solving and reasoning ability than 
most of the Performance subtests in which memory 
and prior experience are more heavily weighted. In 
factor analyses of the Wechsler scales, Block Design 
typically has the highest loading of all the Perfor- 
mance subtests on the second factor. This factor is 
variously identified as nonverbal, visuospatial, 
or perceptual-organizational intelligence (Fowler, 
Zillmer, & Macciocchio, 1990; Silverstein, 1982a). 
On the WISC-III and WAIS-IH, Block Design has 
the highest correlation with Performance IQ for all 
but a few of the standardization groups between ages 
6 and 89. For this reason, Block Design is generally 
recognized as the quintessential index of nonverbal 
intelligence on the Wechsler tests (Gregory, 1999). 

Block Design is a strongly speeded test. Con- 
sider the WAIS-R version, which consists of 14 de- 
signs of increasing difficulty. To obtain a high score 
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on this subtest, adults must not only reproduce each 
of the designs correctly, they must also earn bonus 
points on the last eight designs by completing them 
quickly. An examinee who solves all the designs 
within the time limit but who fails to garner any 
bonus points will test out at just slightly above 
average on this subtest. Block Design scores may 
be misleading for examinees who do not value 
speeded performance. 


Matrix Reasoning 


Matrix Reasoning is a new subtest found only on the 
WAIS-III. It was added to enhance the assessment 
of nonverbal reasoning on the adult test. The subtest 
consists of 26 figural reasoning problems arranged 
in increasing order of difficulty (Figure 6.3). Find- 
ing the correct answer requires the examinee to 
identify a recurring pattern or relationship between 
figural stimuli drawn along a straight line (simple 
items) or in a 3 x 3 grid (hard items) in which the 
last item is missing. Based upon nonverbal reason- 
ing about the patterns and relationships, the exami- 
nee must infer the missing stimulus and select it 
from five choices provided at the bottom of the card. 

Matrix Reasoning was designed to be a measure 
of fluid intelligence, which is the capacity to per- 
form mental operations such as manipulation of ab- 
stract symbols. The items tap pattern completion, 
reasoning by analogy, and serial reasoning. Overall, 
the subtest is an excellent measure of inductive rea- 
soning based on figural stimuli. Matrix Reasoning is 
the only untimed performance subtest on the WAIS- 
II. Interestingly, Donders et al. (2001) report that 
the Matrix Reasoning subtest is relatively unaffected 
by moderate and severe traumatic brain injury. 


Object Assembly 


For each item, the examinee must assemble the 
pieces of a jigsaw puzzle to form a common object 
(Figure 6.4). For example, Object Assembly on 
the WAIS-III consists of five puzzles: a manikin (6 
pieces), a profile (7 pieces), an elephant (6 pieces), 
a house (9 pieces), and a butterfly (7 pieces). The 
examiner does not identify the items, so the exam- 
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inee must first discern the identity of each item 
from its disarranged parts. Success on this subtest 
requires high levels of perceptual organization; that 
is, the examinee must grasp a larger pattern or 
gestalt based on perception of the relationships 
among the individual parts. 








FIGURE6.4 Object Assembly Item Similar to Those 
Found on the WAIS-III 


FIGURE 6.3 
Matrix Reasoning Item Similar to 
Those Found on the WAIS-III 


Object Assembly is one of the least reliable of 
the Wechsler subtests. For example, on the WAIS- 
III this subtest has an average split-half reliability of 
only .70 (Tulsky, Zhu, & Ledbetter, 1997). Among 
the WAIS-III subtests, only Picture Arrangement 
with a value of .74 approaches the unreliability of 
Object Assembly. These two subtests stand apart 
from the other, more reliable, Wechsler subtests. The 
modest reliability of Object Assembly may reflect, 
in part, the small number of items as well as the role 
of chance factors in solving jigsaw puzzles. 


Coding/Digit Symbol 


Although the tasks are nearly identical, this subtest 
is called Coding on the WISC-III and Digit Sym- 
bol-Coding on the WAIS-IH. The WISC-IH version 
consists of two separate and distinct parts, one for 
examinees under age 8 (Coding A) and another 
for those 8 years of age and over (Coding B). In 
Coding A, the child must draw the correct symbol 
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inside a series of randomly sequenced shapes. The 
task utilizes five shapes (star, circle, triangle, cross, 
and square), and each shape is assigned a unique 
symbol (vertical line, two horizontal lines, single 
horizontal line, circle, and two vertical lines, re- 
spectively). After a brief practice session, the child 
is told to draw the correct symbol inside 43 of the 
randomly sequenced shapes. However, since there 
is atwo-minute time limit, high scores require rapid 
performance. 

Coding B on the WISC-III and Digit Symbol- 
Coding on the WAIS-III are identical in format 
(Figure 6.5). For both subtests, the examinee must 
associate one symbol with each of the digits 0 
through 9 and quickly draw the appropriate symbol 
underneath a long series of random digits. The time 
limit for both versions is two minutes. Very few 
examinees manage to code all the stimuli in this 
amount of time. 

Estes (1974) analyzed the Digit Symbol sub- 
test from the standpoint of learning theory and 
concluded that efficient performance requires the 
ability to quickly produce distinctive verbal codes 
to represent each of the symbols in memory. For 
example, in Figure 6.5, the examinee might code 
the symbol underneath the number 2 as an “in- 
verted T.” Verbal coding mediates quick perfor- 
mance by simplifying a difficult task. Efficient 
performance also demands immediate learning of 
the digit-symbol pairings so that the examinee need 
not look from each digit to the reference table to 
determine the correct response. In this regard, Digit 
Symbol is unique: It is the only Wechsler subtest 
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FIGURE 6.5 Digit Symbol Items Similar to Those 
Found on the WAIS-III 
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that necessitates on-the-spot learning of an unfa- 
miliar task. 

Digit Symbol scores show a steep decrement 
with advancing age. In cross-sectional studies, raw 
scores on Digit Symbol decline by as much as 50 
percent from age 20 to age 70 (Wechsler, 1981). The 
decrement is approximately linear and not easily 
explained by superficial references to motivational 
differences or motor slowing. Of course, cross- 
sectional results are not necessarily synonymous 
with longitudinal trends. However, the age decre- 
ment on Digit Symbol is so steep that it must indi- 
cate, in part, a real age change in the speed of basic 
information-processing skills. Digit Symbol is one 
of the most sensitive subtests to the effects of organic 
impairment (Donders et al., 2001; Lezak, 1995). 


Mazes 


This subtest appears only on the WPPSI-R and 
WISC-III and consists of paper-and-pencil mazes 
that the child must solve within a time limit. The 
examinee is told not to lift the pencil and is coun- 
seled “try not to enter any blind alleys.” Full credit 
for each maze is given if the child solves it within 
the time limit (30 seconds to 150 seconds, depend- 
ing upon difficulty) without entering any blind al- 
leys. One raw score point is deducted for each blind 
alley entered. 

Mazes taps perceptual-motor skills, motor 
speed, visual planning, and the ability to inhibit im- 
pulsive responding. This subtest is a poor measure 
of general intelligence, but measures perceptual 
organization reasonably well.-On the WISC-III, 
Mazes is a supplementary subtest not used in com- 
putation of the IQ. 


Symbol Search 


Symbol Search is a performance measure found on 
the WISC-III and the WAIS-III. This is a highly 
speeded subtest in which the examinee looks at a 
target group of symbols, then quickly examines a 
search group of symbols, and finally marks a “YES” 
or “NO” box to indicate whether one or more of the 
symbols in the target group occurred within the 
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Note: The examinee’s task is to determine whether either shape at 
the left occurs among the five shapes to the right. 





FIGURE 6.6 Symbol Search Item Similar to Those 
Found on the WISC-III 


search group. A Symbol Search item is depicted in 
Figure 6.6. This subtest would appear to be a mea- 
sure of processing speed. Symbol Search is highly 
sensitive to the impact of traumatic brain injury 
(Donders et al., 2001). 

WECHSLER ADULT 

INTELLIGENCE SCALE-III 
The WAIS-IIL is a significant revision of the WAIS- 
R, even though many of the previous items were re- 
tained. The most significant changes include the 


addition of three subtests and the inclusion of an al- 
ternative model for scoring the test (four index 








scores to supplement the traditional approach of 


Verbal, Performance, and Full Scale IQ). Other im- 
portant improvements over its predecessor include 
updating and expanding the normative samples, ex- 
tending coverage to age 89, adding easy items to 
improve the assessment of mental retardation, and 
co-norming with the Wechsler Memory Scale-III 
(Gregory, 1999). Because of changes in the WAIS- 
III protocol forms (e.g., prominent display of dis- 
continue rules), this test is somewhat easier to 
administer than the WAIS-R. Sattler and Ryan 
(1999) provide an outstanding overview of the 
WAIS-II in clinical practice. 

The WAIS-II is comprised of 14 subtests, but 
one (Object Assembly) is now optional and used 
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only as a substitute for spoiled subtests under rare 
circumstances (Wechsler, 1997). From the main 
body of 13 subtests, 11 are needed for computation 
of the traditional IQs (Verbal, Performance, and 
Full Scale). The IQ scores are normed to the con- 
ventional average of 100 and standard deviation of 
15 in the general population. The breakdown of 
subtests for the IQ scores is as follows: 


Verbal IQ 
Vocabulary 
Similarities 
Arithmetic 
Digit Span 
Information 
Comprehension 

Performance IQ 
Picture Completion 
Digit Symbol-Coding 
Block Design 
Matrix Reasoning 
Picture Arrangement 


All 11 subtests are used in the computation of Full 
Scale IQ. The Verbal-Performance breakdown of 
subtests on the WAIS-III is nearly identical to that 
found on the WAIS-R. The single difference is the 
addition of Matrix Reasoning in place of Object 
Assembly. 

In addition to the traditional IQ scores, the WAIS- 
III can be scored for four index scores, each based on 
2 or 3 of the 13 subtests. These are derived from fac- 
tor analysis of the subtests, which revealed four 
domains: Verbal Comprehension, Perceptual Organi- 
zation, Working Memory, and Processing Speed. The 
index scores are also based upon the familiar mean 
of 100 and standard deviation of 15. The breakdown 
of subtests for the four index scores is as follows: 


Verbal Comprehension Index 
Vocabulary 
Similarities 
Information 

Perceptual Organization Index 
Picture Completion 
Block Design 
Matrix Reasoning 
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Working Memory Index 
Arithmetic 
Digit Span 
Letter-Number Sequencing 
Processing Speed Index 
Digit Symbol-Coding 
Symbol Search 


The reader will notice that the Verbal Comprehen- 
sion Index (VCI) is similar to Verbal IQ but does not 
include the subtests sensitive to attention (i.e., Digit 
Span and Arithmetic). For this reason, the VCI is a 
more direct measure of verbal comprehension than 
is Verbal IQ. The Perceptual Organization Index 
(POI) is similar to Performance IQ but is less de- 
pendent on speed (because Matrix Reasoning is un- 
timed). For this reason, the POI is a more refined 
measure of fluid reasoning and visual-spatial prob- 
lem solving than is Performance IQ. In these re- 
spects, VCI and POI are more “pure” measures than 
Verbal IQ and Performance IQ, respectively. 

The Working Memory Index (WMI) is com- 
prised of subtests sensitive to attention and imme- 
diate memory (Arithmetic, Digit Span, and Letter- 
Number Sequencing). A relatively low score on this 
index may signify that the examinee has an atten- 
tional or memory problem, especially with verbally 
presented materials. The Processing Speed Index 
(PSI) comprises subtests that require the highly 
speeded processing of visual information (Digit 
Symbol-Coding, Symbol Search). The PSI is 
sensitive to a wide variety of neurological and 
neuropsychological conditions (Tulsky, Zhu, & 
Ledbetter, 1997). 


WAIS-III Standardization 


The standardization of the WAIS-III was undertaken 
with great care and based on data gathered by the 
U.S. Bureau of the Census in 1995. The total sam- 
ple of 2,450 adults (ages 16 to 89) was carefully 
stratified on these variables: sex, race/ethnicity, ed- 
ucational level, and geographic region. Census fig- 
ures from 1995 were used as the target values for 
the stratification variables. For example, of persons 
in the 55- to 64-year-old range, the Census Bureau 
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found that 3.47 percent are African Americans with 
high school education. Hence, 3.5 percent of the 
standardization subjects in this age range were 
African Americans with high school education. 

The standardization sample was divided into 13 
age bands: 16-17, 18-19, 20-24, 25-29, 30-34, 
35-44, 45-54, 55-64, 65-69, 70-74, 75-79, 80-84, 
85-89. Except for the two oldest age groups, each 
sample included 200 participants carefully stratified 
on the demographic variables noted earlier; the 
80-84 age group included 150 participants, and the 
85-89 age group included 100 participants. The re- 
sulting sample bears a very close correspondence to 
the U.S. Census proportions. However, persons sus- 
pected of even mild cognitive impairment were ex- 
cluded, so the standardization sample is probably 
healthier than its Census counterparts. Specifically, 
several exclusionary criteria were used in the stan- 
dardization sample, including color blindness, un- 
corrected hearing or visual impairment, evidence of 
drug/alcohol problems, upper extremity impair- 
ment, use of antianxiety or antidepressant drugs, 
and a variety of potentially brain-impairing condi- 
tions (head injury, stroke, epilepsy, Alzheimer’s dis- 
ease, schizophrenia). 

Although the WAIS-III is similar to the WAIS- 
R and has a substantial item overlap, the two tests 
do not yield analogous IQs. In counterbalanced 
studies comparing scores of 192 adults on the two 
tests, WAIS-III scores are lower by 1 point for Ver- 
bal IQ, 5 points for Performance IQ, and 3 points 
for Full Scale IQ (Tulsky et al., 1997). In short, the 
WAIS-III is a harder test than the WAIS-R. There 
is a troubling enigma here: Why does the normative 
sample for the WAIS-III appear to be smarter than 
the normative sample for the WAIS-R? We take up 
this point in more detail in Topic 7B, Test Bias and 
Other Controversies. 


Reliability 


The reliability of the WAIS-III is exceptionally 
good. Composite split-half reliabilities. averaged 
across all age groups are Verbal IQ, .97; Perfor- 
mance IQ, .94; and Full Scale IQ, .98. Stability co- 
efficients on test-retest for 394 examinees confirm 


much the same picture: Verbal IQ, .96; Perfor- 
mance IQ, .91; and Full Scale IQ, .96. Reliabilities 
and stability coefficients for the four index scores 
tend to be slightly lower, but still at or near .90 in 
all cases. Further supporting the reliability of the 
WISC-III, Zhu, Tulsky, Price, and Chen (2001) re- 
ported recently that reliability estimates for several 
clinical groups were even higher than those found 
in the normative sample. 

For Full Scale IQ, the standard error of mea- 
surement is in the range of 2 to 2% IQ points, de- 
pending upon the age group. Consider what this 
means: 95 percent of the time, an examinee’s true 
Full Scale IQ will be within +5 points (2 standard 
errors of measurement) of the obtained value. In 
common parlance, psychometrists would say that 
WAIS-III IQ has about a 10-point band of error; that 
is, IQ scores are accurate to within about +5 points. 

In contrast to the strong reliabilities found for 
IQs and index scores, the reliabilities of the 14 in- 
dividual subtests are generally much weaker. The 
only subtests with stability coefficients in excess of 
.90 are Information (.94) and Vocabulary (.91). For 
the remaining subtests, reliability values range 
from the low .70s to the mid-.80s. The most im- 
portant implication of these weaker reliability find- 
ings is that examiners should approach subtest 
profile analysis with extreme caution. Subtest 
scores that appear discrepantly high (or low) for an 
individual examinee might be a consequence of the 
generally weak reliability of certain subtests rather 
than indicating true cognitive strengths or weak- 
nesses. Some reviewers conclude that profile analy- 
sis (the identification of specific cognitive strengths 
and weaknesses based upon analysis of peaks and 
valleys in the subtest scores) is not justified by the 
evidence (Gregory, 1994b). 


Validity 


Based on a number of different lines of evidence re- 
viewed here, the validity of the WAIS-III appears to 
be quite satisfactory. Good criterion-related validity 
has been demonstrated in several studies correlating 
the WAIS-III with mainstream intelligence tests and 
also with measures of academic achievement. For 
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example, the WAIS-III Full Scale IQ correlates 
strongly with global scores on other measures: .93 
with the WAIS-R, .88 with the WISC-III (for 16- 
year-olds in the overlapping age groups), .64 with 
the Standard Progressive Matrices, and .88 with the 
Stanford-Binet: Fourth Edition. The WAIS-III IQs 
also correlate strongly with the eight subtests from 
the Wechsler Individual Achievement Test, reveal- 
ing a median correlation of .70 (Tulsky et al., 1997). 
There is no doubt that the WAIS-III captures the 
same aspects of global intelligence measured by 
other widely used instruments. 

The validity of the WAIS-II is also buttressed 
by its strong overlap with the WAIS-R and the 
WAIS, both of which rest upon an impressive array 
of validity data. For a full review of these findings 
the reader should consult Matarazzo (1972) and 
Kaufman (1990). We will present one representa- 
tive and provocative study here, a correlational 
analysis of academic standing and intelligence 
scores. Conry and Plant (1965) correlated WAIS 
scores with high school rank at graduation for 98 
students. They also correlated WAIS scores with 
grade point average at the end of the first year of 
college for 335 students in a second sample. The 
results are portrayed in Table 6.2. Notice that 
Verbal IQ predicts academic success just as well as 
Full Scale IQ, whereas Performance IQ bears a 
weaker relationship to achievement levels. Notice 
also that Vocabulary yields the highest overall cor- 
relation (.65) with academic standing in the entire 
table. This. finding speaks forcefully in favor of 
including vocabulary measures in intelligence 
tests. 

Several studies bolster the construct validity of 
the WAIS-III by showing that test scores in various 
groups are theory-consistent. For example, Sattler 
(1982, 1988) has pointed out that age trends on the 
WAIS-R subtests (which strongly resemble the 
WAIS-III subtests) conform closely to the Cattell- 
Horn theory of fluid and crystallized intelligence. 
The reader will recall from the previous unit in this 
chapter that fluid intelligence is used to solve novel 
problems, whereas crystallized intelligence re- 
quires the retrieval of learned or habitual responses. 
According to the theory, fluid intelligence declines 
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TABLE 6.2 Correlations between High School 
Rank, College Grades, and WAIS Scores 


WAIS Subtests High School College 
and IQs (N = 98) (N= 335) 
Information 0.54 -0.48 
Comprehension 0.55 0.33 
Arithmetic 0.45 0.19 
Similarities 0.50 0.39 
Digit Span 0.37 0.04 
Vocabulary 0.65 0.46 
Digit Symbol 0.34 0.15 
Picture Completion 0.33 0.20 
Block Design 0.29 0.19 
Picture Arrangement 0.22 0.07 
Object Assembly 0.17 0.12 
Verbal IQ 0.63 0.47 
Performance IQ 0.43 0.24 
Full Scale IQ 0.62 0.44 





Source: Adapted with permission from Conry, R., & Plant, W. T. 
(1965). WAIS and group test predictions of an academic success 
criterion: High school and college. Educational and Psychological 
Measurement, 25, 493-500. 


sharply in old age, whereas crystallized intelligence 
remains constant or increases slightly (Horn, 
1985). 

An analysis of the WAIS-R and the WAIS- 
II indicates that the verbal subtests draw more 
heavily upon crystallized intelligence (retrieving 
learned responses), whereas the performance sub- 
tests require high levels of fluid intelligence (solv- 
ing novel problems). Conforming to theoretical 
expectations, an inspection of normative data re- 
veals that raw scores on the verbal subtests show 
minimal decrement with advancing age, whereas 
raw scores on the performance subtests drop off 
precipitously for the older subjects (Wechsler, 
1981, 1997). Of course, these data are cross-sec- 
tional and therefore do not constitute definitive 
proof of longitudinal decline. Nonetheless, the age 
decrements on performance subtests are so striking 
that it strains credulity to attribute them to cohort 
differences or other artifacts. More likely, a signif- 
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icant proportion of this decrement is a real age- 
related decline that corroborates the Cattell-Horn 
theory of intelligence. 

Another theory-consistent expectation borne 
out by empirical findings is a strong relationship 
between educational attainment and IQ scores 
(Kaufman, 1990). These two variables should be 
highly correlated based on two assumptions, 
namely, that education boosts intelligence and that 
more-intelligent persons will generally seek a 
higher level of education. Apparently, analyses of 
the relationship between WAIS-III IQ scores and 
educational attainment have not yet been com- 
pleted. However, research on the previous edition 
is relevant here because of the strong resemblance 
between the two instruments, Matarazzo and Her- 
man (1984) analyzed the total years of schooling 
against the Verbal IQ, Performance IQ, and Full 
Scale IQ of the 1,880 individuals used in the WAIS- 
R standardization sample. Excluding younger sub- 
jects aged 16 to 24, many of whom had not yet 
completed their education, the correlation between 
years of school completed and Full Scale IQ was 
.63 for the 500 subjects aged 25 to 44, and .62 for 
the 730 subjects aged 45 to 74. These findings re- 
veal a very strong correlation between educational 
attainment and IQ scores. Finally, Wechsler IQ and 
occupational attainment also are strongly linked 
(Reynolds, Chastain, Kaufman, & McLean, 1987), 
which further supports the validity of the WAIS-III 
as a measure of general intelligence. 


||] | WECHSLER INTELLIGENCE SCALE 
| FOR CHILDREN-III 


The WISC was published in 1949 as a downward 
extension of the original Wechsler-Bellevue. Al- 
though used widely in the next two decades, psy- 
chometricians perceived a number of flaws in the 
WISC: absence of nonwhites in the standardization 
sample, ambiguities of scoring, inappropriate items 
for children (e.g., reference to “cigars”), and ab- 
sence of females and African Americans in the pic- 
torial content of items. The WISC-R (Wechsler, 
1974) and the WISC-II (Wechsler, 1991) corrected 
these flaws. 





The WISC-III consists of 10 subtests and 3 sup- 
plementary subtests. The verbal and performance 
subtests are administered in alternating order: 


Verbal Subtests | Performance Subtests 
Information Picture Completion 
Similarities Coding 
Arithmetic Picture Arrangement 
Vocabulary Block Design 
Comprehension Object Assembly 
Digit Span Symbol Search 

Mazes 


Digit Span, Symbol Search, and Mazes are sup- 
plementary subtests not normally included in the 
computation of IQ. However, these subtests are 
usually administered because of the diagnostic in- 
formation they provide. In the event that a subtest 
is disrupted during administration and therefore 
spoiled, or must be omitted because of special dis- 
abilities, Digit Span (for verbal subtests) or Mazes 
(for performance subtests) may be substituted. 
Symbol Search can be used as a substitute only for 
the Coding subtest. 

The standardization of the WISC-III is excep- 
tionally good, based on 100 boys and 100 girls at 
each year of age from 6% through 16% (total N = 
2,200). These cases were carefully selected and 
stratified on the basis of the 1988 U.S. Census with 
respect to race/ethnicity (White, African American, 
Native American, Eskimo, Aleut, Asian, Pacific Is- 
lander, and Other), geographic region, and parent 
education. Although not a formal stratification vari- 
able, community size for the WISC-III standard- 
ization sample resembled census data closely. The 
standardization sample was drawn from both pub- 
lic and private schools, including children in spe- 
cial service programs. A desirable feature of the 
sample is that 7 percent of the children were cate- 
gorized as having learning disabilities, emotional 
disturbances, or speech/language impairments, and 
so forth, and 5 percent of the sample consisted of 
children in programs for gifted and talented per- 
sons. The reliability of the WISC-III is comparable 
to the WAIS-R: The three IQ scores (Verbal, 
Performance, and Full Scale) show split-half and 
test-retest reliabilities in the .90s, whereas the indi- 
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vidual subtests possess somewhat lower split-half 
coefficients, ranging from .69 (Object Assembly) 
to .87 (Vocabulary and Block Design). Test-retest 
reliabilities tend to be slightly lower. 

The validity of the WISC-III rests, in part, upon 
its overlap with the WISC-R, for which dozens of 
supportive studies could be cited. We do not want 
to overwhelm with excessive detail, so we refer the 
interested reader to Sattler (1988) fora good review 
of earlier studies. The WISC-III manual cites an 
impressive array of validity studies, which we sum- 
marize here. The preliminary studies indicate 
strong correlations with WISC-R scores (r= .90 for 
VIQ, .81 for PIQ, and .89 for FSIQ), strong corre- 
lations with WAIS-R scores for 16-year-olds (r = 
.90 for VIQ, .80 for PIQ, and .86 for FSIQ), and 
slightly lower but confirmatory correlations with 
WPPSI-R scores for a sample of 6-year-old chil- 
dren. These correlations are virtually as high as the 
reliabilities of the respective scales would allow. An 
interesting finding, discussed in the next chapter, is 
that WISC-III IQs are an average of about 5 points 
lower than WISC-R IQs (Vance, Maddux, Fuller, 
& Awadh, 1996). 

The WISC-III also shows theory-confirming 
correlations with numerous cognitive, ability, and 
achievement tests (Wechsler, 1991). For example, 
in a study of 27 children ages 7 to 14 years who 
were administered both the WISC-III and the 
Differential Ability Scales (DAS, Elliott, 1990), 
WISC-III VIQ ‘scores correlated .87 with Verbal 
Ability scores from the DAS, but only :58 with 
Nonverbal Reasoning scores from the DAS. Con- 
versely, WISC-III PIQ scores correlated .78 with 
Nonverbal Reasoning scores, but only .31 with 
Verbal Ability. The manual cites theory-consistent 
correlations—appropriately high for similar con- 
structs, low for dissimilar constructs—with the 
Otis-Lennon School Ability Test, the Benton Re- 
vised Visual Retention, Test, subtests from the Hal- 
stead-Reitan neuropsychological test battery, and 
the Wide Range Achievement Test-Revised. Stud- 
ies with special groups of children—those who are 
gifted, retarded, learning disabled, hyperactive, 
conduct disordered, speech/language delayed— 
also provide strong support for WISC-III validity. 
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Factor-analytic studies of the standardization 
sample provided additional evidence for the utility 
of the WISC-III in diagnostic assessment of chil- 
dren. The results of numerous factor analyses, in- 
cluding separate analyses for four age-group 
subsamples (ages 6-7, 8-10, 11-13, and 14-16), 
strongly indicated a four-factor solution. The first 
two factors are familiar from previous studies of the 
Wechsler scales: 


Verbal Perceptual 
Comprehension Organization 
Information Picture Completion 
Similarities ‘Picture Arrangement 
Vocabulary Block Design 
Comprehension Object Assembly 


The third factor on the WISC-III is slightly differ- 
ent from the third factor on its predecessor, whereas 
the fourth factor is a new one: 


Freedom from Processing 
Distractibility Speed 
Arithmetic Coding 

Digit Span Symbol Search 


One subtest, Mazes, revealed an inconsistent assign- 
ment to factors, loading weakly on Freedom from 
Distractibility for 6- to 7-year-olds and weakly on 
Perceptual Organization for children 8 years and 
older. 

The four-factor. solution to the WISC-III pro- 
vides for the optional reporting of index scores 
(similar to IQ scores) for each of the four factors. 
These scores are based upon the familiar mean of 
100 and standard deviation of 15. The index scores 
(Verbal Comprehension Index or VCI, Perceptual 
Organization Index or POI, etc.) are derived from 
the allocation of subtests previously listed and 
serve to supplement Verbal IQ and Performance IQ. 

Even though four factors have emerged statisti- 
cally in several factor-analytic studies of the WISC- 
Ill (e.g., Tupa, Wright, & Fristad, 1997; Roid & 
Worrall, 1997; Konold, Kush, & Canivez, 1997), the 
clinical validity of the third factor, Freedom from 
Distractibility (FFD), has been called into question. 
For example, when 200 children with attention- 
deficit hyperactivity disorder (6 to 11 years of age) 
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completed the WISC-III and several other measures, 
FFD scores were not correlated with a measure of 
sustained visual attention. In fact, the majority of the 
children did not even show a significant relative 
weakness on the FFD index (Reinecke, Beebe, & 
Stein, 1999). In a factor-analytic study of 1,201 stu- 
dents with learning disabilities, Watkins and Kush 
(2002) concluded: “Current results add to a growing 
body of evidence suggesting that WISC-III Verbal 
Comprehension, Perceptual Organization, and Pro- 
cessing Speed factors are robust across samples but 
the Freedom from Distractibility factor demonstrates 
tenuous construct validity” (p. 4). Riccio, Cohen, 
Hall, and Ross (1997) also found a minimal- rela- 
tionship between FFD scores and independent mea- 
sures of attention in a sample of children with 
presumed attentional problems. In sum, it is unclear 
what the FFD factor measures or whether it mea- 
sures a consistent construct at all. 





||| STANFORD-BINET INTELLIGENCE 
||| SCALES: FIFTH EDITION 


With a lineage that goes back to the Binet-Simon 
scale of 1905, the Stanford-Binet: Fifth Edition 
(SB5) has the oldest and perhaps the most prestigious 
pedigree of any individual intelligence test. In Table 
6.3, we outline some important milestones in the de- 
velopment of the SBS and its predecessors. Released 
in 2003, the SBS is a very new test (Roid, 2002, 
2003). For this reason, evaluation of this instrument 
is based, in part, upon its resemblance in content and 
subtests to the SB4, about which a large body of in- 
dependent research literature has been amassed. 


The SB5 Model of Intelligence 


In early editions of the Stanford-Binet, the exam- 
iner obtained only a composite IQ. Although the 
pattern of right and wrong answers could be ana- 
lyzed qualitatively, the earlier Stanford-Binet tests 
(prior to the fourth edition) did not provide a basis 
for quantitative analysis of the subcomponents of 
the entire scale. The fourth and fifth editions cor- 
rected this shortcoming. 
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TABLE 6.3 Milestones in the Development of the Stanford-Binet and 





Predecessor Tests 
Year Test/Authors Comment 
1905 Binet and Simon Simple 30-item test 
1908 Binet and Simon Introduced the mental age concept 
1911 Binet and Simon Expanded to include adults 
1916 Stanford-Binet Introduced the IQ concept 


Terman and Merrill 

1937 Stanford-Binet-2 
Terman and Merrill 

1960: Stanford-Binet-3 
Terman and Merrill 

1972 Stanford-Binet-3 
Thorndike 

1986 Stanford-Binet-4 
Thorndike, Hagen, 
and Sattler 

2003 Stanford-Binet-5 
Roid 


First use of parallel forms (L and M) 
Modern item-analysis methods used 
SB-3 restandardized on 2,100 persons 


Complete restructuring into 15 subtests 


Five factors of intelligence 





The organization of the SB5 was guided by the 
principle that each of five factors of intelligence can 
be assessed in two distinct domains—nonverbal and 
verbal. The five factors—derived from modern cog- 
nitive theories such as Carroll (1993) and Baddeley 
(1986)—are fluid reasoning, knowledge, quantita- 
tive reasoning, visual-spatial processing, and work- 
ing memory. When these five factors of intelligence 
are “crossed” with the two domains (nonverbal and 
verbal), the result is an instrument with ten subtests 
(Figure 6.7). Thus, the SB5 provides a number of 
different perspectives on the cognitive functioning 
of an examinee: ten subtest scores (mean of 10, SD 
of 3), three IQ scores (the familiar Full Scale IQ, 
Verbal IQ, and Nonverbal IQ), as well as five factor 
scores (Fluid Reasoning, Knowledge, Quantitative 
Reasoning, Visual-Spatial Processing, and Working 
Memory). The IQ and factor scores are normed to a 
mean of 100 and SD of 15. 


Routing Procedure and Tailored Testing 


The SB5 maintains the historical tradition of this 
instrument by using a routing procedure to esti- 
mate the general cognitive ability of the examinee 


before proceeding to the remainder of the test. The 
purpose of the routing procedure is to identify the 
appropriate starting points for subsequent subtests. 
The routing items are both nonverbal (object series 
and matrices) and verbal (vocabulary). These items 
also provide the Abbreviated IQ, sometimes used 
for screening purposes. Roid (2002) describes the 
advantages of using a routing procedure: 


This tailored approach to assessment provides 
greater richness of factor measurement within a 
shorter, efficient test administration. The use of 
modern item response theory in the design of SB5 
allows for greater precision of measurement due to 
the adaption of the test to the functional level of the 
examinee in an efficient time frame. 


Thus, the-purpose of the routing procedure is not 
just to reduce the number of items administered 
(and therefore save time), but to do so without loss 
of measurement precision. This is possible because 
the SB5 was constructed according to the princi- 
ples of item response theory (Embretson, 1996). 
When a test is constructed within the framework of 
item response theory, item difficulty levels and 
other parameters are precisely calibrated during the 
development phase. 






196 _CHAPTER6 INTELLIGENCE TESTING Il: INDIVIDUAL AND GROUP TESTS 
DOMAINS 
Nonverbal 
Fluid Nonverbal Verbal Fluid 
. Reasoning Fluid Reasoning Reasoning 
i Nonverbal Verbal 
Quantitative | Nonverbal Quantitative | Verbal Quantitative 
Reasoning Reasoning Reasoning 
FACTORS 
Visual-Spatial | Nonverbal Visual- Verbal Visual- 
Reasoning Spatial Processing Spatial Processing 
Working Nonverbal Working Verbal Working 
Memory Memory Memory 
Nonverbal IQ Verbal IQ 
FIGURE 6.7 Structure of the ren Em 


Standford-Binet: Fifth Edition 


Special Features of the SB5 


In addition to providing a more familiar partition 
of intelligence into Full Scale IQ, Verbal IQ, and 
Nonverbal IQ, the SBS also features a number of 
other improvements over its predecessor, the SB4. 
The test now includes extensive high-end items, de- 
signed to assess the highest level of gifted perfor- 
mance. Many of these items are updates from very 
early editions of the Stanford-Binet, when the in- 
strument was renowned for its very high ceiling. At 
the other extreme, improved low-end items provide 
better assessment for very young children (as 
young as age 2) and adults with mental retardation. 
In addition, the items and subtests that contribute to 
the Nonverbal IQ do not require expressive lan- 
guage, which makes this part of the test ideal for as- 
sessing individuals with limited English, deafness, 
or communication disorders. The developers of the 
SB5 also screened test items for fairness based on 
religious as well as traditional concerns. Expert 
panels examined the entire test on fairness is- 
sues related to the standard variables (gender, race, 
ethnicity, and disability) and religious tradition 
(Christian, Jewish, Muslim, Hindu, and Buddhist 


backgrounds). This is the first time in the history of 
intelligence testing that religious tradition has been 
considered in test development. Finally, the Work- 
ing Memory factor, consisting of both verbal and 
nonverbal subtests, shows promise in helping to 
assess and understand children with attention- 
deficit/hyperactivity disorder. 


Standardization and Psychometric 
Properties of the SB5 


The SBS is suitable for children age 2 through 
adults age 85 and older, and the standardization 
sample consists of 4,800 individuals stratified by 
gender, ethnic, regional, and educational levels in 
the United States, based on the year 2000 census. 
In part because item selection was determined by 
modern item response theory, the reliability of sub- 
tests, indices, and IQ scores is very strong and com- 
parable to other mainstream individual intelligence 
tests. For example, the Verbal IQ, Nonverbal IQ, 
and Full Scale IQ each have reliabilities in the .90s, 
and the individual subtests are in the range of .70 
to .85 (Roid, 2002). 
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As is typical in the release of a new test, the 
manual for the SB5 (Roid, 2003) reports on numer- 
ous affirming correlational studies (e.g., with the 
Wechsler scales, the SB4, the UNIT) that provide 
strong support for criterion-related validity. The va- 
lidity of the test as a measure of general intelligence 
is also supported by its resemblance to the SB4, 
about which a large body of research can be cited. 
For example, Lamp and Krohn (2001) studied the 
longitudinal predictive validity of the SB4 in a sam- 
ple of 89 Head Start children (39 African American 
and 50 white) from impoverished backgrounds who 
ranged in age from about 4 to 6%. These children 
were retested several times over an 8-year period on 
both the SB4 and the Metropolitan Achievement 
Test. The correlations between the initial SB4 score 
and the subsequent achievement scores were very 
strong (mainly in the .50s), and the test was equally 
good at predicting outcome for African American 
and white children. In another study (Atkinson, 
Bevc, Dickens, & Blackwell, 1992), the concurrent 
validity of the SB4 was tested against the Leiter In- 
ternational Performance Scale and the Vineland 
Adaptive Behavior Scales in a sample of 24 children 
with developmental delays. The correlations were 
very robust (.78 and .70, respectively). These and 
many other studies strongly support the validity of 
the SB4 as a measure of general intelligence. As 
new research is reported on the SBS, it is likely that 
this recent edition also will prove to be highly valid 
and even more useful than its predecessor as a mea- 
sure of intelligence. 

In summary, the SBS is a very promising new test 
that is especially useful at both ends of the cognitive 
spectrum—the very young or those with develop- 
mental delays, and very gifted persons. Based upon 
the care with which the instrument was constructed, 
the test is likely to become a mainstay of individual 
intelligence testing in a wide variety of settings. 


|| DETROIT TESTS OF 
I LEARNING APTITUDE-4 


The Detroit Tests of Learning Aptitude-4 (DTLA-4; 
Hammill, 1999) is a recent revision of an instru- 
ment first published in 1935. The test is individu- 
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ally administered and designed for schoolchildren 
from 6 through 17 years of age. The DTLA-4 con- 
sists of 10 subtests that form the basis for comput- 
ing 16 composites, including general intelligence, 
optimal level, and 14 ability areas. The subtests 
are largely within the Binet-Wechsler tradition, 
although there are a few surprises such as the in- 
clusion of Story Construction, a measure of story- 
telling ability (Table 6.4). 

The General Mental Ability composite is 
formed by combining standard scores for all 10 
subtests in the battery. The Optimal Level compos- 
ite is based upon the highest 4 standard scores 
earned by the subject and is thought to represent 
how well the examinee might perform under opti- 
mal circumstances. Each of the remaining 14 com- 
posite scores is derived from a combination of 
several subtests thought to measure a common at- 
tribute. For example, subtests that involve knowl- 
edge of words and their use are combined to form 
the Verbal Composite, whereas subtests that do not 


TABLE 6.4 Brief Description of the DTLA-4 
Subtests 


Subtest Task 





Word Opposites Provide antonyms— word 
opposites. 
Discriminate and remember 


nonsensical graphic material. 


Design Sequences 


Sentence Imitation Repeat orally presented 
sentences. 

Reversed Letters Short-term visual memory 
and attention. 

Story Construction Create a logical story from 
several pictures. 

Design Reproduction Copy designs from memory. 


Basic Information Knowledge of everyday facts 


and information. 


Symbolic Relations Select from a series of de- 
signs the part that was miss- 
ing from a previous design. 

Word Sequences Repeat a series of unrelated 


words. 
Organize pictorial material 
into meaningful sequences. 


Story Sequences 
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involve reading, writing, or speech comprise the 
Nonverbal Composite. Several of the composite 
scores are designed to represent major constructs 
within contemporary theories of intelligence. In ad- 
dition to the General Mental Ability composite and 
the Optimal Level composite, the remaining 14 
DTLA-3 composite scores are as follows: 


Verbal Nonverbal (Linguistic) 

Attention- Attention- (Attentional) 
enhanced reduced 

Motor- Motor- (Motoric) 
enhanced reduced 

Fluid Crystallized (Horn & Cattell) 

Simultaneous Successive (Das) 

Associative Cognitive (Jensen) 

Verbal Performance (Wechsler) 


The 16 composite scores are based upon the famil- 
iar mean of 100 and standard deviation of 15. The 
10 subtests are normed for a mean of 10 and stan- 
dard deviation of 3. 

The composites were designed to offer con- 
trasting assessments such that a difference between 
scores may be of diagnostic significance. For ex- 
ample, an examinee who scored well on Attention- 
Reduced aptitude but poorly on Attention-Enhanced 
aptitude (in the Attentional domain) presumably ex- 
periences difficulty with immediate recall, short- 
term memory, or focused concentration. 

The DTLA-4 was standardized on 1,350 stu- 
dents whose backgrounds closely matched census 
data for sex, race, urban/rural residence, family in- 
come, educational attainment of parents, and geo- 
graphic area. The reliability of this instrument is 
similar to other individual tests of intelligence, with 
internal consistency coefficients generally exceed- 
ing .80 for the subtests and .90 for the composites, 
and test-retest coefficients for the subtests and the 
composites in the .80s and .90s. Criterion-related 
validity is well established through correlational 
studies with other mainstream instruments such as 
the WISC-III, K-ABC, and Woodcock-Johnson. 

A concern with the DTLA-4 is that the con- 
ceptual breakdown into composites is not suffi- 
ciently supported by empirical evidence. For 
example, while it may be true that the Simultane- 


ous composite does measure the simultaneous cog- 
nitive processes proposed by Das, Kirby, and Jar- 
man (1979), there is scant empirical support to 
buttress this claim. Another problem with this in- 
strument is that there are more composites than 
there are subtests! Inevitably, the composites will 
be highly intercorrelated, because each subtest oc- 
curs in several composites. In sum, DTLA-4 may 
be a good measure of general intelligence, but the 
use of composite scores for purposes of psycho- 
educational planning requires additional empiri- 
cal study. Smith (2001) and Traub (2001) provide 
thorough reviews of the DTLA-4. 


KAUFMAN BRIEF INTELLIGENCE 
TEST (K-BIT) 


The individual intelligence tests previously dis- 
cussed in this and the preceding topic are excellent 
measures of intellectual ability, but they are not 
without their drawbacks. One problem is the time 
required to administer them. Testing sessions with 
the Wechsler scales, Kaufman Assessment Battery 
for Children, and the Stanford-Binet easily can last 
one hour, and two hours is not unusual if the ex- 
aminee is bright and highly verbal. A second dis- 
advantage to these mainstream tests is the amount 
of training required to administer them. Proper ad- 
ministration of most individual intelligence tests is 
based upon the assumption that the examiner has 
an advanced degree in psychology or a related field 
and has received extensive supervised experience 
with the instruments in question. 

Alan Kaufman responded to the need for a 
brief, easily administered screening measure of in- 
telligence by developing the Kaufman Brief Intel- 
ligence Test (K-BIT; Kaufman & Kaufman, 1990; 
Kaufman & Wang, 1992). The K-BIT consists of 
a Vocabulary section and a Matrices section. The 
Vocabulary test contains two parts: Expressive Vo- 
cabulary (naming pictures) and Definitions (pro- 
viding a word based upon a brief phrase and a 
partial spelling). The Matrices test requires solving 
2 x 2 and 3 x 3 analogies using figural stimuli. 

The K-BIT is normed for subjects ages 4 to 90 
years and can be administered in 15 to 30 minutes. 


TOPIC 6A 


The test yields standard scores with mean of 100 
and SD of 15 for Vocabulary, Matrices, and the 
combination of the two, called the IQ Composite. 
In spite of the comparability of these scoring di- 
mensions with well-known intelligence tests, the 
K-BIT authors make it clear that their instrument is 
not intended as a substitute for traditional ap- 
proaches (e.g., WPPSI-R, K-ABC, WISC-III, or 
SB:FE). The K-BIT is mainly a screening test use- 
ful in signalling the need for more extensive as- 
sessment. The brevity of this instrument also makes 
it a natural choice for research on intelligence. 

Reliability findings for the K-BIT are excep- 
tionally strong. Split-half reliability and test-retest 
coefficients for a variety of samples were in the .90s 
for Vocabulary, the .80s and .90s for Matrices, and 
.90s for IQ Composite. The normative sample of 
2,022 individuals was within 1 to 3 percentage 
points of the 1990 U.S. Census figures for sex, ge- 
ographic region, race or ethnic group, and educa- 
tional attainment of the parents (for subjects 4 to 19 
years of age) or examinees themselves (20 years of 
age and above). 

The K-BIT manual reports highly supportive va- 
lidity data from 20 correlational studies. These re- 
sults are similar to a recent concurrent validity study 
that compared K-BIT results and WAIS-R scores for 
200 referrals to a neuropsychological assessment 
center (Naugle, Chelune, & Tucker, 1993). The pa- 
tient sample included persons with seizure disor- 
ders, head injuries, substance abuse, psychiatric 
disturbance, stroke, dementia, and other neurologi- 
cal conditions. The heterogeneity of the referral 
sample guaranteed a wide range of functional abil- 
ity, a desirable feature in a validation study. Al- 
though the K-BIT scores tended to be about 5 points 
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higher than their WAIS-R counterparts, the correla- 
tions between these two instruments were extremely 
high and theory-confirming. Vocabulary IQ (K-BIT) 
and Verbal IQ (WAIS-R) correlated .83; Matrices IQ 
(K-BIT) and Performance IQ (WAIS-R) correlated 
.77, and overall IQs from the two instruments cor- 
related an amazing .88. In a study comparing the K- 
BIT and the WISC-III. scores for 50. referred 
students, Prewett (1995) also reported strong corre- 
lations (r = .78 for overall scores) and also discov- 
ered that the K-BIT scores tended to be about 5 
points higher than their WISC-III counterparts. Ina 
sample of 65 children with reading disability, Chin, 
Ledesma, Cirino, and others (2001) also found that 
the K-BIT overestimated WISC-III IQs by 1.2 to 5.0 
points, on average. However, their study also 
showed that, in individual cases, K-BIT scores can 
underestimate or overestimate WISC-III scores by 
as much as 25 points, reaffirming that the K-BIT is 
not appropriate for placement and diagnostic pur- 
poses. Canivez (1995) found comparable scores 
between the K-BIT and the WISC-III for 137 ele- 
mentary- and middle-school children and also re- 
ported very strong correlations between the two 
tests, especially for overall scores (r = .87). Eisen- 
stein and Engelhart (1997) found that the K-BIT 
performed well in estimating IQs in adult neuro- 
psychology referrals, but Donders (1995) recom- 
mends caution when using the test with brain- 
injured children. The reason for caution is that 
K-BIT scores show a negligible relationship with 
length of coma; that's, the test is not a good index 
of neuropsychological status in children. Even so, 
the K-BIT is an outstanding screening measure of 
general intelligence for use in research or when time 
constraints preclude use of a longer measure. 


SUMMARY 


1. For purposes of estimating general intelli- 
gence, any well-normed mainstream instrument 
will suffice. However, for purposes of individual- 
ized assessment, examiners need to consider the 
particular strengths and weaknesses of potential 
instruments. 


2. David Wechsler was a pragmatist who bor- 
rowed heavily from the Army Alpha and Beta tests 
in fashioning many of the subtests from the various 
Wechsler scales. For each of his intelligence tests, 
Wechsler used 10 to 13 subtests—about half verbal 
and half performance. 
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3. The first Wechsler test was the Wechsler- 
Bellevue, published in 1939 and updated in 1946. 
Other tests and their dates of revision are Wechsler 
Preschool and Primary Scale of Intelligence- 
Revised (1967, 1989); Wechsler Intelligence Scale 
for Children-III (1949, 1974, 1991); and Wechsler 
Adult Intelligence Scale-III (1955, 1981, 1997). 


4. Allofthe Wechsler scales use the same for- 
mat: 10 to 13 subtests about equally divided be- 
tween verbal and performance emphasis; a 
common metric for IQ, with mean of 100 and stan- 
dard deviation of 15; a common core of subtests, so 
that examiners can easily transfer their test-giving 
skills from one Wechsler scale to another. 


5. The Wechsler Adult Intelligence Scale-III 
(WAIS-II) is the most widely used individual test 
of adult intelligence. The test has excellent relia- 
bility, well-established validity, and adds a new 
subtest: Matrix Reasoning. 


6. Factor analysis of the Wechsler Intelli- 
gence Scale for Children-III (WISC-III, for chil- 
dren ages 6 to 16%) often yields a four-factor 
solution: Verbal Comprehension (mainly verbal 
subtests), Perceptual Organization (mainly perfor- 
mance subtests), Freedom from Distractibility 
(Arithmetic and Digit Span), and Processing Speed 
(Coding and Symbol Search). 

7. The newly released Stanford-Binet: Fifth 
Edition (SB5) features the partition of intelligence 
into five factors and two domains (verbal and non- 
verbal), resulting in ten subtests. The five factors, 
each represented by verbal and nonverbal subtests, 
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are Fluid Reasoning, Knowledge, Quantitative 
Reasoning, Visual-Spatial Reasoning, and Working 
Memory. 


8. In the SB5, two routing procedures (Object 
Series/Matrices and Vocabulary) are used to reduce 
the number of items subsequently administered 
without sacrificing measurement precision. The test 
is normed for persons from age 2 to 85 and older. 
The use of item response theory in test development 
resulted in strong reliabilities for IQ, factor, and sub- 
test scores. 


9. Special features of the test include exten- 
sive high-end items and improved low-end items, 
so that the test excels at both extremes of the cog- 
nitive spectrum. The SBS is also the first intelli- 
gence test to consider religious diversity (Christian, 
Jewish, Muslim, Hindu, and Buddhist back- 
grounds) in the evaluation of test fairness. 


10. The Detroit Tests of Learning Aptitude-4 
(DTLA-4) consists of 10 subtests that form the 
basis for computing 16 composites. The DTLA-4 
is a good measure of general intelligence, but the 
conceptual breakdown into 14 ability areas needs 
empirical support. 

11. The Kaufman Brief Intelligence Test (K- 
BIT) is a well-normed screening test of general in- 
tellectual ability that consists of Vocabulary and 
Matrices. Although the K-BIT yields IQs about 5 
points higher than the WAIS-R, the exceedingly 
strong correlation between these two instruments 
(r = .88) is highly supportive of K-BIT validity. 


KEY TERMS AND CONCEPTS 


IQ constancy p. 180 
verbal comprehension p. 194 
perceptual organization p. 194 


freedom from distractibility p. 194 
processing speed p. 194 
routing procedure p. 195 


Topic 6B Group Tests of Intelligence 


Origins and Characteristics of Group Tests 
Multidimensional Aptitude Battery (MAB) 

Shipley Institute of Living Scale (SILS) 

A Multilevel Battery: The Cognitive Abilities Test (CogAT) 
Culture Fair Intelligence Test (CFIT) 

Raven’s Progressive Matrices (RPM) 

Perspective on Culture-Fair Tests 


Summary 
Key Terms and Concepts 


group intelligence test allows for the quick 
and efficient testing of dozens or hundreds 
of examinees at the same time. In this topic we in- 
troduce the reader to a sampling of prominent 
group tests. For better or for worse, the number of 
group tests currently marketed is simply astonish- 
ing—scores of them are available. Several dozen 
entries are reviewed in recent issues of the Mental 
Measurements Yearbook (Mitchell, 1985; Conoley 
& Kramer, 1989, 1992) and the Test Critiques se- 
ries (Keyser & Sweetland, 1984-1988), and new 
instruments are published every year. Comprehen- 
sive coverage of this burgeoning field is simply not 
feasible. Consequently, we focus here on issues 
raised by group tests and then review an eclectic as- 
sortment of these diverse instruments. 
ORIGINS AND CHARACTERISTICS 
OF GROUP TESTS 
Origins of Group Tests 


The first useful group intelligence tests were de- 
veloped early in the twentieth century in the United 
States. Nonetheless, the origins of these instru- 
ments can be traced to the efforts of nineteenth- 
century European psychologists. The modern 
group intelligence test owes a debt especially to the 
completion technique developed in the 1890s by 
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Ebbinghaus (1896). His test consisted of several 
passages of text with words or parts of words omit- 
ted, as in the following brief example: 
Little Red Riding Hood 

there was a sweet young , beloved 
by every who eyes on her. Her 

mother gave her a little cap of 
silk, which she wore the time. The 

was known as Red Riding Hood. 
One her mother told her, “Your 
ill and weak. take this cake and wine to 
her. Do not stray from the ___ and do not 
to strangers.” 














is 








The student’s task was to fill in as many blanks 
as possible (for several selections) in a five-minute 
time limit. The completion test was commonly ad- 
ministered to an entire class by one person. The 
task was highly speeded: Only four times in several 
thousand cases did a student fill in all of the blanks. 
Ebbinghaus used the total number correct as a basis 
for comparing individuals as to their intellectual 
ability (DuBois, 1970). 

A few years later, the practical success of the 
Binet scales inspired psychologists to develop in- 
telligence tests that could be administered simulta- 
neously to large numbers of examinees. We have 
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noted in a previous chapter that the need to quickly 
test thousands of Army recruits for WWI inspired 
psychologists in the United States, led by Robert 
M. Yerkes, to make rapid advances in psychomet- 
rics and test development. Parallel developments 
occurred in school systems where administrators 
desired an efficient means for testing and placing 
students. However, fill-in-the-blank and open- 
ended questions severely limited the efficiency of 
assessment. Group testing quickly evolved into its 
modern design: the multiple-choice format. 


Differences between Group and Individual Tests 


Group tests differ from individual tests in five ways: 


e Multiple-choice versus open-ended format 

« Objective machine scoring versus examiner 
scoring 

¢ Group versus individualized administration 

Applications in screening versus remedial 

planning 

e Huge versus merely large standardization 
samples 


We discuss each of these points in turn. 

The most obvious difference is that group tests 
generally employ a multiple-choice format. Al- 
though early group tests did use open-ended ques- 
tions, this feature was quickly dropped because of 
the excessive amounts of time required for scoring. 
As a result of the multiple-choice format, group 
tests can be quickly and objectively scored by an 
optical scanning device hooked up to a computer. 
Computer scoring eliminates examiner errors and 
halo effects that may occur in the scoring of indi- 
vidual tests. In addition, psychometricians gain 
nearly instant access to item analyses and test data 
banks, so computer scoring promotes the quick de- 
velopment and revision of group tests. 

Group tests also differ from individual tests in 
the mode of administration. In a group test, the 
examiner plays a minimal role that is restricted 
largely to reading instructions and enforcing time 
limits. There is negligible opportunity for one-on- 
one interaction between the test giver and the test 
taker. For most examinees, this will not matter, 
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but for a few—the shy, the confused, the unmoti- | 
vated—the absence of examiner rapport can have 
disastrous results. 

Traditional intelligence tests excel as aids in the 
diagnosis and remediation of individual learning 
difficulties, whereas group intelligence tests are 
more commonly used for mass screening in the fur- 
therance of institutional decision making. Thus, 
group tests might be used in school systems to 
“flag” children in need of academic remediation or 
enrichment; in industrial settings to identify good 
candidates for specific jobs; or in military settings 
to help cull out mentally impaired recruits. 

Group tests are generally standardized on ultra- 
large samples—hundreds of thousands of subjects 
instead of just the few thousand carefully selected 
cases used with individual tests. Of course, the suit- 
ability of a standardization sample must never be 
taken for granted. Whether using huge standard- 
ization samples for group testing, or smaller stan- 
dardization samples for individual testing, it is still 
important to determine the degree to which the 
sample is representative of the population at large. 


Advantages and Disadvantages 
of Group Testing 


Although the early psychometric pioneers embraced 
group testing wholeheartedly, they recognized fully 
the nature of their Faustian bargain: Psychologists 
had traded the soul of the individual examinee in re- 
turn for the benefits of mass testing. Whipple (1910) 
summed up the advantages of group testing but also 
pointed to the potential perils: 


Most mental tests may be administered either to in- 
dividuals or to groups. Both methods have advan- 
tages and disadvantages. The group method has, of 
course, the particular merit of economy of time; a 
class of 50 or 100 children may take a test in less 
than a fiftieth or a hundredth of the time needed to 
administer the same test individually. Again, in cer- 
tain comparative studies, e.g., of the effects of a 
week’s vacation upon the mental efficiency of 
school children, it becomes imperative that all S’s 
should take the tests at the same time. On the other 
hand, there are almost sure to be some S’s in every 
group that, for one reason or another, fail to follow 
instructions or to execute the test to the best of their 


ability. The individual method allows E to detect 
these cases, and in general, by the exercise of per- 
sonal supervision, to gain, as noted above, valuable 
information concerning S’s attitude toward the test. 


In sum, group testing poses two interrelated risks: 
(1) some examinees will score far below their true 
ability, owing to motivational problems or diffi- 
culty following directions, and (2) invalid scores 
will not be recognized as such, with undesirable 
consequences for these atypical examinees. There 
is really no simple way to entirely avoid these risks, 
which are part of the trade-off for the efficiency of 
group testing. However, it is possible to minimize 
the potentially negative consequences if examiners 
scrutinize very low scores with skepticism and rec- 
ommend individual testing for these cases. 

We turn now to an analysis of several promi- 
nent group intelligence tests. The reader is re- 
minded that, owing to the sheer number of these 
instruments, our review is necessarily selective. We 
present a balance of older, established instruments 
and newer, promising additions to the field, begin- 
ning with a test that attempts to bridge the gap be- 
tween individual and group tests of intelligence. 


BATTERY (MAB) 


The Multidimensional Aptitude Battery (MAB; 
Jackson, 1984a) is a recent group intelligence test 
designed to be a paper-and-pencil equivalent of the 
WAIS-R (Krieshok & Harrington, 1985). As the 
reader will recall, the WAIS-R is a highly respected 
instrument (now replaced by the WAIS-IH), in its 
time the most widely used of the available adult in- 
telligence tests. Kaufman (1983) noted that the 
WAIS-R was “the criterion of adult intelligence, and 
no other instrument even comes close.” However, a 
highly trained professional needs about 1% hours 
just to administer the Wechsler adult test to a single 
person. Because professional time is at a premium, 
a complete Wechsler intelligence assessment—in- 
cluding administration, scoring, and report writ- 
ing—easily can cost hundreds of dollars. Many 
examiners have long suspected that an appropriate 
group test, with the attendant advantages of objec- 
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tive scoring and computerized narrative report, could 
provide an equally valid and much less expensive al- 
ternative to individual testing for most persons. 


Background and Description 


The MAB was designed to produce subtests and 
factors parallel to the WAIS-R, but employing a 
multiple-choice format capable of being computer 
scored. The apparent goal in designing this test was 
to produce an instrument that could be adminis- 
tered to dozens or hundreds of persons by one ex- 
aminer (and perhaps a few proctors) with minimal 
training. In addition, the MAB was designed to 
yield IQ scores with psychometric properties sim- 
ilar to those found on the WAIS-R. Appropriate for 
examinees from age 16 to 74, the MAB yields 10 
subtest scores, as well as Verbal, Performance, and 
Full Scale IQs. 

Although it consists of original test items, the 
MAB is mainly a sophisticated subtest-by-subtest 
clone of the WAIS-R. The 10 MAB subtests are 
listed as follows: 


Verbal Performance 
Information Digit Symbol 
Comprehension Picture Completion 
Arithmetic Spatial 

Similarities Picture Arrangement 
Vocabulary Object Assembly 


The reader will notice that Digit Span from the 
WAIS-R is not included on the MAB. The reason 
for this omission is largely practical: There would 
be no simple way to present a Digit-Span-like sub- 
test in paper-and-pencil format. In any case, the 
omission is not serious. Digit Span has the lowest 
correlation with overall WAIS-R IQ, and it is widely 
recognized that this subtest makes a minimal con- 
tribution to the measurement of general intelligence. 

The only significant deviation from the WAIS- 
R is the replacement of Block Design with a Spa- 
tial subtest on the MAB. In the Spatial subtest, 
examinees must mentally perform spatial rotations 
of figures and select one of five possible rota- 
tions presented as their answer (Figure 6.8). Only 
mental rotations are involved (although “flipped- 
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Pers Completion — Choose the letter that begins the word describing the missing part of 
the picture. 


AAE N 


= 
a I 





The answer is Light, so A should be marked. 


Spatial — Choose one figure to the right of the vertical line which is the same as the figure on 


the left. One figure can be turned to look like the figure on the left; the others would have to be 
flipped over. 





The correct answer is A, so A should be marked. The others BCDE would have to be flipped 
over. 


Object Assembly — Choose the order, from left to right, in which these parts should be 
placed to form the object. 





The correct answer is C-132 so C should be marked. Only this order would create the object 
teacup. 





FIGURE6.8 Demonstration Items from Three Performance Tests of the Multidimensional 

Aptitude Battery (MAB) 

Source: Reprinted with permission from Jackson, D. N. (1984a). Manual for the Multidimensional Aptitude Battery. 
Port Huron, MI: Sigma Assessment Systems, Inc. (800) 265-1285. 


over” versions of the original stimulus are included 
as distractor items). The advanced items are very 
complex and demanding. 

The items within each of the 10 MAB subtests 
are arranged in order of increasing difficulty, be- 
ginning with questions and problems that most 
adolescents and adults find quite simple and pro- 
ceeding upward to items that are so difficult that 
very few persons get them correct. There is no 
penalty for guessing and examinees are encouraged 
to respond to every item within the time limit. Un- 
like the WAIS-R in which the verbal subtests are 
untimed power measures, every MAB subtest in- 
corporates elements of both power and speed: Ex- 
aminees are allowed only seven minutes to work on 
each subtest. Including instructions, the Verbal and 
Performance portions of the MAB each take about 
50 minutes to administer. 


Technical Features 


The first release of the MAB (Jackson, 1984a) was 
not standardized in the traditional manner in which 
scores are tied to the performance of large, repre- 
sentative samples stratified on such variables as sex, 
race, urban/rural residence, parental occupation, ge- 
ographic region, and the like. Instead, the test de- 
velopers pursued a strategy of calibrating MAB 
scores to the WAIS-R as an anchor test. To derive 
the linear calibrating formula, the WAIS-R and 
MAB were both administered, in counterbalanced 
fashion, to a sample consisting of university stu- 
dents (n = 18), senior high school students (n = 74), 
hospitalized psychiatric patients (n = 58), and pro- 
bationers (n = 10). The subjects, 117 males and 43 
females, ranged in age from 16 to 35. The correla- 
tion coefficients between MAB and WAIS-R IQs 
were found to be .82, .65, and .91, for Verbal, Per- 
formance, and Full Scale IQ, respectively. The norm 
tables reported in the manual actually reflect a sim- 
ple linear transformation from MAB raw scores to 
WAIS-R IQs for this initial sample of 160 subjects. 

The manual reports several studies of internal 
consistency and test-retest reliability; the results are 
generally quite impressive. For example, in one 
study of over 500 adolescents ranging in age from 
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16 to 20, the internal consistency reliability of Ver- 
bal, Performance, and Full Scale IQs was in the 
high .90s. In a test-retest study of 52 young psy- 
chiatric patients, the individual subtests showed re- 
liabilities that ranged from .83 to .97 (median of 
.90) for the Verbal scale and from .87 to .94 (me- 
dian of .91) for the Performance scale. These 
results compare quite favorably with the psy- 
chometric standards reported for the WAIS-R 
(Wechsler, 1981). 

Factor analyses of the MAB offer strong sup- 
port for the construct validity of this instrument 
(Lee, Wallbrown, & Blaha, 1990; Wallbrown, 
Carmin, & Barnett, 1988). In a factor analysis of 
scores for 3,121 male and female high school stu- 
dents, the manual reports a general factor with 
moderate to high loadings for all subtests (ranging 
from .53 to .82). In a separate factor analysis of data 
for the standardization subjects, Lee, Wallbrown, 
and Blaha (1990) found two orthogonal factors 
after the first general factor. These two rotated fac- 
tors are clearly identifiable as Verbal and Perfor- 
mance factors. In addition, other researchers have 
noted the extremely strong congruence between 
factor analyses of the WAIS-R (with Digit Span re- 
moved) and the MAB. Ina large sample of inmates, 
Ahrens, Evans, and Barnett (1990) observed valid- 
ity-confirming changes in MAB scores in relation 
to education level. Thus, there is good justification 
for the use of separate Verbal and Performance 
scales on this test. 

In general, the validity of the MAB rests upon 
its very strong physical and empirical resemblance 
to its parent test, the WAIS-R. Correlational data be- 
tween MAB and WAIS-R scores are crucial in this 
regard. For 145 persons administered the MAB and 
WAIS-R.in counterbalanced fashion, correlations 
between subtests ranged from .44 (Spatial/Block 
Design) to .89 (Arithmetic and Vocabulary), with a 
median of .78. WAIS-R and MAB IQ correlations 
were very healthy, namely, .92 for Verbal IQ, .79 for 
Performance IQ, and .91 for Full Scale IQ (Jackson, 
1984a). With only a few exceptions, correlations be- 
tween MAB and WAIS-R scores exceed those be- 
tween the WAIS and the WAIS-R. Carless (2000) 
reported a similar, strong overlap between MAB 


206 CHAPTERG6 INTELLIGENCE TESTING Il: INDIVIDUAL AND GROUP TESTS 


scores and WAIS-R scores in a study of 85 adults 
for the Verbal, Performance, and Full Scale IQ 
scores. However, she found that 4 of the 10 MAB 
subtests did not correlate with the WAIS-R sub- 
scales they were designed to represent, suggesting 
caution in using this instrument to obtain detailed 
information about specific abilities. 


Comment on the MAB 


Jackson (1984a) exercised great care in the de- 
velopment of the MAB, continually refining it 
over a period of some 10 years prior to release. Dur- 
ing this time, items were selected, revised, and 
deleted, according to stringent psychometric crite- 
ria regarding difficulty level, discriminatory power, 
and efficiency of distractor alternatives. Not 
surprisingly, the resulting instrument is a technical 
tour de force of psychometric excellence. Reliabil- 
ity indices are strong, factor analyses confirm the 
verbal/performance dichotomy, and subtest scores 
and overall IQs correlate exceptionally well with 
corresponding measures from the WAIS-R. 

Nonetheless, several reviewers have raised cau- 
tions and concerns about the MAB that deserve 
mention. Krieshok and Harrington (1985) note that 
the manual does not provide readability estimates 
for the instructions or for the items themselves. The 
manual does state vaguely that the MAB “presup- 
poses language skills necessary to read and under- 
stand written directions and to comprehend spoken 
directions.” However, it does not recommend a min- 
imum reading level for valid administration. This 
may lead the examiner to assume that anyone who 
meets the minimum age level of 16 can take the 
MAB, a patently unsafe presumption. In fact, 
Krieshok and Harrington (1985) subjected the 
MAB to a computerized readability analysis and 
concluded that some verbal items required a tenth- 
grade reading level. Because of the relatively high 
reading level required by parts of this test, it seems 
likely that an otherwise very bright student with a 
reading disability might score artificially low on the 
MAB. 

The MAB shows great promise in research, ca- 
reer counseling, and personnel selection. In addi- 


tion, this test could function as a screening instru- 
ment in clinical settings, so long as the examiner 
views low scores as a basis for follow-up testing 
with the WAIS-R. Examiners must keep in mind 
that the MAB is a group test and therefore carries 
with it the potential for misuse in individual cases. 
The MAB should not be used in isolation for diag- 
nostic decisions or for placement into programs 
such as classes for intellectually gifted persons. 






[|| SHIPLEY INSTITUTE OF 
LIVING SCALE (SILS) 


III) 

The Shipley Institute of Living Scale (SILS) is also 
known as the Shipley-Hartford because of its in- 
ception in Hartford, Connecticut, decades ago 
(Shipley, 1940, 1983). The SILS was originally 
proposed as an index of intellectual deteriora- 
tion, in an attempt to gauge the effects of demen- 
tia, brain damage, and other organic conditions. 
However, the test has been used primarily as a short 
screening test of intelligence, particularly within 
the mental health system of the Veterans Adminis- 
tration. 





Background and Description 


The SILS consists of two subtests, vocabulary and 
abstractions. The original intention of the test was 
to detect organic intellectual deterioration by con- 
trasting performance on the vocabulary and ab- 
stractions sections. Vocabulary was thought to be 
relatively unaffected by organic deterioration, 
whereas it was believed that abstraction ability 
would show significant decline. A large discrepancy 
favoring vocabulary over abstractions therefore 
would appear to signify the presence of organic im- 
pairment. However, numerous studies and reviews 
concluded that the SILS performs poorly as an 
index: of brain damage (e.g., Yates, 1954; Johnson, 
1987), and the instrument is seldom used for this 
purpose. 

The SILS consists of 40 multiple-choice vo- 
cabulary items and 20 abstract-thinking items. 
Each item is scored right or wrong. The abstract 


items count double, so the maximum score on each 
half of the test is 40 points. A composite score is 
also reported. The test is self-administered with a 
10-minute limit for each of the two sections. Some 
users favor an untimed use of the test, and separate 
norms have been developed for this approach 
(Heinemann, Harper, Friedman, & Whitney, 1985). 
Few persons require more than 10 minutes per 
section; most examiners consider the SILS to be 
entirely a power measure. A microcomputer ver- 
sion of the test is also available. The computer ad- 
ministers and scores the test and produces a 
narrative report and graphic depiction of scores. 
The examinee’s task on the vocabulary section 
is to select the synonym of a word from four alter- 
natives. The 40 items resemble the following: 


fork boat 
silly -dry 


« SHIP house tree 
¢ INANE fat timely 


The vocabulary score is the number correct plus one 
point for every four items omitted. Adding points 
for items omitted provides a correction for the re- 
fusal to guess. As a result of this correction factor, 
the minimum score is about 10 out of the 40 points. 

The intention of the abstractions items is that 
they should require the examinee to infer a princi- 
ple common to a given series of components and 
then to demonstrate this understanding of the prin- 
ciple by finishing the series. Each item is a series 
of letters or numbers followed by blanks to indicate 
the number of characters in the answer. The 20 
items resemble the following: 


« ABDGK 
e boghob marstram 268 
« 135 3415212 


The examinee must complete each series and place 
the appropriate answer in the blanks. (Answers to 
the preceding items are P, 962, and 3). Of course, 
to derive the correct answer the examinee must 
infer the rule that governs the progression of stim- 
uli in each item and then use that rule to determine 
the continuation. (In item 1 the distance between 
letters increases arithmetically; in item 2 the pairs 
are mirror images of each other, except for last and 
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first letters which increment by one—gtoh, s to t; 
in item 3, each group of numbers sums to one less 
than the previous group—9, 8, 7, ...). 


Technical Features 


Zachary (1986) has published norms for the SILS 
based on 290 mixed psychiatric patients who had 
also taken the WAIS. The sample contains approx- 
imately equal numbers of men and women. This 
norm sample is young: Most are between the ages 
of 16 and 54, with a median age of 30. Based on 
this sample, the manual contains tables of age-cor- 
rected T scores (mean of 50, SD of 10) for vocab- 
ulary and abstractions. Against the better advice of 
numerous prior researchers, the author of the SILS 
manual also introduced the Abstraction Quotient 
(AQ), a new impairment index based on the differ- 
ence between Vocabulary and Abstraction scores. 
The AQ is obtained by comparing the predicted ab- 
stractions score to the obtained abstractions score. 
The predicted score is derived from a regression 
equation that uses vocabulary score, age, and edu- 
cational level. The AQ is an improvement over 
previous impairment indices in that naturally oc- 
curring age decrements are accounted for in its 
computation. Persons with schizophrenia and other 
individuals with diminished intellectual efficiency 
tend to obtain low AQs. Nonetheless, there are non- 
pathological causes of a low AQ (e.g., a distaste for 
abstract concepts), and the utility of this index is 
therefore open to question. Mason, Lemmon, 
Wayne, and Schmidt (1991) have attempted to re- 
vive the AQ approach to the use of the Shipley by 
publishing regression equations for computing Ab- 
straction Quotients that use age, sex, and social 
class as moderating variables. However, they do not 
provide any evidence for the validity of the AQ as 
an index of brain impairment. After a comprehen- 
sive review of the literature, Phay (1990) deter- 
mined that the SILS is inadequate as a test of 
cognitive deterioration even when corrections are 
made for age and education. 

The reliability of the SILS is marginal. Typical 
internal consistency measures (odd-even correla- 
tions) are .87 (vocabulary), .89 (abstractions), and 
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.92 (total score). However, as noted by the Stan- 
dards for Educational and Psychological Testing 
(AERA, APA, & NCME, 1985), split-half coeffi- 
cients of the odd-even variety produce inflated re- 
liability estimates for highly speeded tests. To the 
extent that scores on the SILS are based upon speed 
instead of power, these reliabilities will be artifi- 
cially high. Test-retest reliabilities are probably 
more appropriate for the SILS. These reliabilities 
vary considerably in the literature, but approach .80 
for the total score in larger and more heterogeneous 
samples (Johnson, 1987). 

Insofar as the SILS is used primarily as a 
screening test of intelligence, the validity of this in- 
strument is strongly linked to its ability to predict 
Full Scale IQ from individual tests such as the 
WAIS or WAIS-R. As reviewed by Johnson (1987), 
literally dozens of correlational studies have inves- 
tigated the accuracy of the SILS as a predictor of 
Wechsler IQ (e.g., Zachary, Crumpton, & Spiegel, 
1985). Correlations between the SILS and the 
Wechsler-Bellevue or WAIS range from .65 to .90 
with a median of .76 (Johnson, 1987). Based on 
these studies, Johnson (1987) reports that the 95 
percent confidence interval for SILS-estimated IQ 
is about +11 IQ points. For example, a Shipley total 
score of 60 for a 40-year-old man converts to an es- 
timated WAIS-R IQ of 102; in 95 percent of such 
cases the examinee’s actual WAIS-R IQ will fall in 
the range of 91 to 113 (Zachary, 1986). 


Comment on the SILS 


The venerable SILS is areasonably good measure of 
general intelligence that has found widespread use 
in research. In addition, the instrument continues to 
be quite popular as a screening test for general in- 
telligence and possible intellectual inefficiency 
(Bowers & Pantle, 1998). While the SILS is useful 
for very broadband intellectual screening, it should 
not be used to make more fine-grained discrimina- 
tions. Responsible clinicians will use an individual 
intelligence test (e.g., K-BIT, WAIS-II) when a 
more precise individual assessment is needed. 
Even though it is a passable screening test, the 
SILS possesses a number of significant limitations: 
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1. The SILS is inappropriate for low-IQ persons or 
those with significant language disabilities. 

2. The test has a low ceiling, especially on the ab- 
stractions section, and does not spread high-IQ 
examinees. 

3. The SILS has a band of error approaching 11 IQ 
points, which may be too excessive for many 
applications. 


A MULTILEVEL BATTERY: THE 
COGNITIVE ABILITIES TEST (CogAT) 


One important function of psychological testing is 
to assess students’ abilities that are prerequisite to 
traditional classroom-based learning. In designing 
tests for this purpose, the psychometrician must 
contend with the obvious and nettlesome problem 
that school-aged children differ hugely in their in- 
tellectual abilities. For example, a test appropriate 
for a sixth grader will be much too easy for a tenth 
grader, yet impossibly difficult for a third grader. 

The answer to this dilemma is a multilevel bat- 
tery, a series of overlapping tests. In a multilevel 
battery, each group test is designed for a specific 
age or grade level, but adjacent tests possess some 
common content. Because of the overlapping con- 
tent with adjacent age or grade levels, each test pos- 
sesses a suitably low floor and high ceiling for 
proper assessment of students at both extremes of 
ability. In addition, multilevel batteries usually pro- 
vide a much desired continuity in the abilities mea- 
sured. Furthermore, multilevel batteries generally 
employ highly comparable normative samples at 
the successive levels. For all of these reasons, mul- 
tilevel batteries are considered ideal for gauging 
student readiness for school learning. Virtually 
every school system in the United States uses at 
least one nationally normed multilevel battery. 

The Cognitive Abilities Test (CogAT) is one of 
the best school-based test batteries in current use 
(Lohman & Hagen, 2001). A recent revision of the 
test is the CogAT Multilevel Edition, Form 6, re- 
leased in 2001. We discuss this instrument in some 
detail and then provide a brief summary of com- 
peting tests. 


Background and Description 


The CogAT evolved from the Lorge-Thorndike In- 
telligence Tests, one of the first group tests of 
intelligence intended for widespread use within 
school systems.:The CogAT is primarily a measure 
of scholastic ability, but also incorporates a non- 
verbal reasoning battery with items that bear no di- 
rect relation to formal school instruction. The two 
primary batteries, suitable for students in kinder- 
garten through third grade, are briefly discussed at 
the end of this section, Here we review the multi- 
level edition intended for students in third through 
twelfth grade. 

The nine subtests of the multilevel CogAT are 
grouped into three batteries as follows: 


. Verbal Quantitative Nonverbal 

Battery Battery Battery 

Verbal Quantitative Figure 
Classification Relations Classification 

Sentence Number Figure 
Completion Series Analogies 

Verbal Equation Figure 
Analogies Building Analysis 


For each CogAT subtest, items are ordered by dif- 
ficulty level in a single test booklet. However, entry 
and exit points differ for each of eight overlapping 
levels (A through H). In this manner, grade-appro- 
priate items are provided for all examinees. All sub- 
tests except one use a multiple-choice format. The 
exception is Figure Analysis, in which the exami- 
nee responds yes or no to a series of alternatives. 
The subtests are strictly timed, with limits 
that vary from eight to twelve minutes. Each of 
the three batteries can be administered in less than 
an hour. However, the manual recommends three 
successive testing days for younger children. For 
older children, two batteries should be administered 
the first day, with a single testing period the next. 
Many subtests of the CogAT bear a striking 
resemblance to portions of the Stanford-Binet: 
Fourth Edition. For example, both tests include 
paper-folding items. Common parentage is the ex- 
planation: Both tests were developed by Elizabeth 
Hagen; both tests were published by Riverside Pub- 
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lishing Company. We see once again the hybrid char- 
acter of modern intelligence tests, in which new tests 
incorporate the best features of their predecessors. 

Raw scores for each battery can be transformed 
into an age-based normalized standard score with 
mean of 100 and standard deviation of 15. In addi- 
tion, percentile ranks and stanines for age groups 
and grade level are also available. Interpolation was 
used to determine fall, winter, and spring grade- 
level norms. 


Technical. Features 


The CogAT was co-normed (standardized con- 
currently) with two achievement tests, the Iowa 
Tests of Basic Skills and the Iowa Tests of Educa- 
tional Development. Concurrent standardization 
with achievement measures is a common and de- 
sirable practice in the norming of multilevel intel- 
ligence tests. The particular virtue of joint norming 
is that the expected correspondence between intel- 
ligence and achievement scores is determined with 
great precision. As a consequence, examiners can 
more accurately identify underachieving students 
in need of remediation or further assessment for po- 
tential learning disability. 

The reliability of the CogAT is exceptionally 
good. In previous editions, the Kuder-Richardson- 
20 reliability estimates for the multilevel batteries 
averaged .94 (Verbal), .92 (Quantitative), and .93 
(Nonverbal) across all grade levels. The six-month 
test-retest reliabilities for alternate forms ranged 
from .85 to .93 (Verbal), .78 to .88 (Quantitative), 
and .81 to .89 (Nonverbal). 

The manual provides a wealth of information 
on content, criterion-related, and construct validity 
of the CogAT; we summarize only the most perti- 
nent points here. Correlations between the CogAT 
and achievement batteries are substantial. For ex- 
ample, the CogAT verbal battery correlates in the 
.70s to .80s with achievement subtests from the 
Iowa Tests of Basic Skills. 

The CogAT batteries predict school grades 
reasonably well. Correlations range from the .30s 
to the .60s, depending upon grade level; sex, and 
ethnic group. There does not appear to be a clear 
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trend as to which battery is best at predicting grade 
point average. Correlations between the CogAT 
and individual intelligence. tests are also substan- 
tial, typically ranging from .65 to .75. These find- 
ings speak well for the construct validity of the 
CogAT insofar as the Stanford-Binet is widely 
recognized as an excellent measure of individual 
intelligence. 


Comment on the CogAT 


The CogAT multilevel edition is a highly reliable 
group test of intelligence that is carefully normed 
for students in grades three through twelve. The 
concurrent standardization with two achievement 
tests is a welcome and practical feature. In support 
of CogAT validity, correlations. with grades, 
achievement, and other measures of intelligence 
are quite robust. Recently, a German version of the 
CogAT has been produced (Perleth, Hofmann, 
Schauer, & Wernberger, 1994). 

Ansorge (1985) has questioned whether all 
three batteries are really necessary. He points out 
that correlations among the Verbal, Quantitative, 
and Nonverbal batteries are substantial. The me- 
dian values across all grades are as follows: 


Verbal and Quantitative 78 
Nonverbal and Quantitative  .78 
Verbal and Nonverbal 72 


Since the Quantitative battery offers little unique- 
ness, from a purely psychometric point of view 
there is no justification for including it. Nonethe- 
less, the test authors recommend use of all batter- 
ies in hopes that differences in performance will 
assist teachers in remedial planning. However, the 
test authors do not make a strong case for doing 
this. 

A recent study by Stone (1994) provides a no- 
table justification for using the CogAT as a basis for 
student evaluation. He found that CogAT scores for 
403 third graders provided an unbiased prediction 
of student achievement that was more accurate than 
teacher ratings. In particular, teacher ratings 
showed bias against caucasian and Asian American 
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students by underpredicting their achievement 
scores. 






| CULTURE FAIR INTELLIGENCE 
I] TEST (CFIT) 


The Culture Fair Intelligence Test (Cattell, 1940; 
IPAT, 1973) is a nonverbal measure of fluid intelli- 
gence first conceived in the 1920s by the prominent 
measurement psychologist Raymond B. Cattell. 
The goal of the CFIT is to measure fluid intelli- 
gence—analytical and reasoning ability in abstract 
and novel situations—in a manner that is as “free” 
of cultural bias as possible. This test was originally 
called the Culture Free Intelligence Test. The name 
was changed when it became evident that cultural 
influences cannot be completely extirpated from 
tests of intelligence. 


Background and Description 


The CFIT has undergone several revisions, emerg- 
ing in its current form in 1961. The test consists of 
three versions: Scale 1 is for use with mentally de- 
fective adults and children ages four to eight; Scale 
2 is for adults in the average range of intelligence 
and children ages eight to thirteen; Scale 3 is for 
high-ability adults and for high school and college 
students. Scale 1 involves considerable interaction 
between tester and examinee—four of the subtests 
must be administered individually. Thus, in some 
respects Scale 1 is more of an individual intelli- 
gence test than a group test. We discuss only Scales 
2 and 3 here, because they are truly group tests of 
intelligence. These two tests differ mainly in diffi- 
culty level. 

Two equivalent forms, called Form A and Form 
B, are available for each scale. The test developers 
recommend administering both forms to each sub- 
ject to obtain what is called the full test. Each form 
by itself is referred to as a short test. In spite of the 
recommendation to use both forms as a combined 
test, it is very common for CFIT users to rely upon 
a single, brief form for purposes of screening. 

Each form consists of four subtests: Series, Clas- 
sification, Matrices, and Conditions. Sample items 


are shown in Figure 6.9. Of course, each subtest is 
preceded by several practice items. The entire test is 
neatly packaged in an eight-page booklet. 

The CFIT is a highly speeded test. Each form 
of Scales 2 and 3 takes about 30 minutes to admin- 
ister, but only 12.5 minutes is devoted to actual test 
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taking. Results can therefore be misleading for per- 
sons who place no premium on speed of perfor- 
mance in problem solving. Fortunately, Scale 2 can 
be used as an untimed power test. However, the 
norms for this manner of administration are limited 
(IPAT, 1973). 





Mazes 


nt 


Classification 
Pick out the two odd items in each row of figures. 


ASIEN! 


Olo 


Conditions 


Pick the item on the right that fulfills the same conditions. 





= 





SOCOR 


Series 


Choose one figure from the six on the right that logically continues the series of three 


figures at the left. 


anns 


REITER 
a 





FIGURE 6.9 Sample Items from the Culture Fair Intelligence Test 
Source: Copyright © by the Institute for Personality and Ability Testing, Inc. Reproduced by permission. 
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Technical Features 


Standardization samples for Scales 2 and 3 were re- 
spectably large, but not described in sufficient de- 
tail to determine the extent to which they mirror the 
general population. The standardization samples 
were characterized as follows: 


The standardization group for Scale 2 consists 

of 4,328 males and females sampled from varied 
regions of the United States and Britain. Scale 3 
norms are based on 3,140 cases, consisting of 
American high school students equally divided 
among freshmen, sophomores, juniors, and seniors, 
and young adults in a stratified job sample. (IPAT, 
1973) 


Raw scores are converted to normalized standard 
score IQs with mean of 100 and standard deviation 
of 16. 

Test-retest, alternate-forms, and internal con- 
sistency reliabilities are generally in the .70s for in- 
dividual forms of Scales 2 and 3. The reliabilities 
of the full test are higher, generally in the mid-.80s. 
These results are based on dozens of studies with 
thousands of subjects and indicate a respectable de- 
gree of reliability for such a short instrument (IPAT, 
1973). 

The validity of the CFIT as a measure of gen- 
eral intelligence is established beyond any reason- 
able skepticism. CFIT scores correlate in the 
mid-.80s with the general factor of intelligence and 
show consistently robust relationships—largely in 
the .70s and .80s—with other mainstream measures 
of intelligence (WAIS, WISC, Raven Progressive 
Matrices, Stanford-Binet, Otis, and General Apti- 
tude Test Battery; see IPAT, 1973, p. 11). There is 
no doubt that the CFIT is a well-designed, useful, 
and valid test of intelligence. 

But is the CFIT a culture-fair test, as its title 
proclaims? One professed goal of this instrument 
was to “minimize irrelevant influences of cultural 
learning and social climate” and thereby produce a 
“cleaner separation of natural ability from specific 
learning” (IPAT, 1973). Unfortunately, the available 
evidence indicates that the CFIT is no more suc- 
cessful than traditional measures in the pursuit of a 
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culturally fair method for measuring intelligence 
(Koch, 1984). For example, Willard (1968) found 
that 83 culturally disadvantaged African American 
children scored about the same on the Stanford- 
Binet (M = 68.1) as on the CFIT (M = 70.0). More- 
over, 14 of the children hit the CFIT “floor” and 
received the lowest possible CFIT IQ score of 57, 
whereas Stanford-Binet IQs scores were dispersed 
in a pattern more like a bell-shaped curve. 


Comment on the CFIT 


The CFIT is an excellent brief, nonverbal measure 
of general intelligence. Even when Form A and 
Form B are both used to obtain what is referred to 
as the full test, the CFIT can be administered to 
large groups in less than an hour. An important cau- 
tion to test users is that the laudable goal of pro- 
ducing a culture-fair test has not been accomplished 
by the CFIT. Moreover, the goal itself may be 
chimerical: 


Cultures differ with respect to the importance they 
place on competition with peers in performing 
tasks or solving problems, on speed or quality of 
performance, and on a variety of other test-related 
behaviors. Some cultures emphasize concrete 
rather than abstract problem solving, often to the 
extent that a problem has no meaning except in a 
concrete setting. The very notion of taking some 
artificially contrived test is nonsensical in such sit- 
uations. (Koch, 1984) : 


It is doubtful that a truly culture-fair test is even 
possible. In future editions, the CFIT developers 
would be well advised to rename their test so that 
unsophisticated users do not invest this instrument 
with imaginary properties. 

Even though the CFIT is a worthy test, it is 
badly in need of revision and renorming. The test is 
rather old-fashioned in appearance. Some of the test 
item drawings are so small that only persons with 
perfect vision can infer the figural relations depicted 
in the item components. Previous standardization 
samples have been poorly specified and would ap- 
pear to be convenience samples rather than carefully 


selected stratified representations of the population 


at large. 
RAVEN’S PROGRESSIVE 
MATRICES (RPM) 


First introduced in 1938, Raven’s Progressive Ma- 
trices (RPM) is a nonverbal test of inductive reason- 
ing based on figural stimuli (Raven, Court, & 
Raven, 1986, 1992). This test has been very popular 
in basic research and is also used in some institu- 
tional settings for purposes of intellectual screening. 











Background and Description 


RPM was originally designed as a measure of 
Spearman’s g factor (Raven, 1938). For this reason, 
Raven chose a special format for the test that pre- 
sumably required the exercise of g. The reader is 
reminded that Spearman defined g as the “eduction 
of correlates.’ The term eduction refers to the 
process of figuring out relationships based on the 
perceived fundamental similarities between stim- 
uli. In particular, to correctly answer items on the 
RPM, examinees must identify a recurring pattern 
or relationship between figural stimuli organized in 
a 3 x3 matrix. The items are arranged in order of 
increasing difficulty, hence the reference to pro- 
gressive matrices. 

Raven’s test is actually a series of three dif- 
ferent instruments. Much of the confusion about 
validity, factorial structure, and the like stems 
from the unexamined assumption that all three 
forms should produce equivalent findings. The 
reader is encouraged to abandon this unwarranted 
hypothesis. Even though the three forms of the 
RPM resemble one another, there may be subtle 
differences in the problem-solving strategies re- 
quired by each. 

The Coloured Progressive Matrices is a 36-item 
test designed for children from 5 to 11 years of age. 
Raven incorporated colors into this version of the 
test to help hold the attention of the young children. 
The Standard Progressive Matrices is normed for 
examinees from 6 years and up, although most of 
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the items are so difficult that the test is best suited 
for adults. This test consists of 60 items grouped 
into 5 sets of 12 progressions. The Advanced Pro- 
gressive Matrices is similar to the Standard version, 
but has a higher ceiling. The Advanced version con- 
sists of 12 problems in Set I and 36 problems in Set 
II. This form is especially suitable for persons of 
superior intellect. 


Technical Features 


Large sample U.S. norms for the Coloured and 
Standard Progressive Matrices are reported in 
Raven and Summers (1986). Separate norms for 
Mexican American and African American children 
are included. Although there was no attempt to use 
a Stratified random-sampling procedure, the selec- 
tion of school districts was so widely varied that the 
American norms for children appear to be reason- 
ably sound. Sattler (1988) summarizes the relevant 
norms for all versions of the RPM. Recently, 
Raven, Court, and Raven (1992) produced new 
norms for the Standard Progressive Matrices, but 
Gudjonsson (1995) has raised a concern that these 
data are compromised because the testing was not 
monitored. 

For the Coloured Progressive Matrices, split- 
half reliabilities in the range of .65 to .94 are re- 
ported, with younger children producing lower 
values (Raven, Court, & Raven, 1986). For the 
Standard Progressive Matrices, a typical split-half 
reliability is .86, although lower values are found 
with younger subjects (Raven, Court, & Raven, 
1983). Test-retest reliabilities for all three forms 
vary considerably from one sample to the next 
(Burke, 1958; Raven, 1965; Raven et al., 1986). For 
normal adults in their late teens or older, reliability 
coefficients of .80 to .93 are typical. However, for 
preteen children, reliability coefficients as low as 
.71 are reported. Thus, for younger subjects, RPM 
may not possess sufficient reliability to warrant its 
use for individual decision making. 

Factor-analytic studies of the RPM provide lit- 
tle, if any, support for the original intention of the 
test to measure a unitary construct (Spearman’s g 
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factor). Several studies of the Coloured Progressive 
Matrices reveal three orthogonal factors (Carlson 
& Jensen, 1980; Wiedl & Carlson, 1976). Factor I 
consists largely of very difficult items and might be 
termed closure and abstract reasoning by analogy. 
Factor II is labeled pattern completion through 
identity and closure. Factor III consists of the eas- 
iest items and is defined as simple pattern comple- 
tion (Carlson & Jensen, 1980). In sum, the very 
easy and the very hard items on the Coloured Pro- 
gressive Matrices appear to tap different intellec- 
tual processes. 

The Advanced Progressive Matrices breaks 
down into two factors that may have separate pre- 
dictive validities (Dillon, Pohlmann, & Lohman, 
1981). The first factor is composed of items in 
which the solution is obtained by adding or sub- 
tracting patterns (Figure 6.10a). Individuals per- 
forming well on these items may excel in rapid 
decision making and in situations where part-whole 
relationships must be perceived. The second factor 
is composed of items in which the solution is based 
on the ability to perceive the progression of a pat- 
tern (Figure 6.10b). Persons who perform well on 
these items may possess good mechanical ability as 
well as good skills for estimating projected move- 
ment and performing mental rotations. However, the 
skills represented by each factor are conjectural at 
this point and in need of independent confirmation. 

A huge body of published research bears on the 
validity of the RPM. The early data is well summa- 
rized by Burke (1958), while more recent findings 
are compiled in the current RPM manuals (Raven & 
Summers, 1986; Raven, Court, & Raven, 1983, 
1986, 1992). In general, validity coefficients with 
achievement tests range from the .30s to the .60s. 
As might be expected, these values are somewhat 
lower than found with more traditional (verbally 
loaded) intelligence tests. Validity coefficients with 
other intelligence tests range from the .50s to the 
.80s. Also, as might be expected, the correlations 
tend to be higher with performance than with ver- 
bal tests. In a massive study involving thousands of 
schoolchildren, Saccuzzo and Johnson (1995) con- 
cluded that the Standard Progressive Matrices and 
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FIGURE 6.10 Raven’s Progressive Matrices: 
Typical Items 


the WISC-R showed approximately equal predic- 
tive validity and no evidence of differential validity 
across eight different ethnic groups. In a lengthy re- 
view, Raven (2000) discusses stability and variation 
in the norms for the Raven’s Progressive Matrices 
across cultural, ethnic, and socioeconomic groups 
over the last 60 years. Indicative of the continuing 
interest in this venerable instrument, Costenbader 
and Ngari (2001) describe the standardization of the 
Coloured Progressive Matrices in Kenya. 

Johnson, Saccuzzo, and Guertin (1994) accom- 
plished the (nearly) impossible by developing a 


truly comparable alternate form of the 60-item 
Standard Progressive Matrices. For each of the 
original 60 items, they developed a similar item that 
was comparable in terms of difficulty level and un- 
derlying cognitive strategy required for solution. 
An alternate-forms reliability analysis on a diverse 
group of 449 children who took both tests in coun- 
terbalanced order revealed a reliability coefficient 
of .90, which is on a par with immediate test-retest 
data. In this same sample, the distribution of scores 
showed no differences for standard deviation, 
skewness, and rank order of item difficulties. The 
mean number correct was 36.1 on the SPM and 
35.5 on the new test. In sum, the two versions ofthe 
test are nearly identical in overall psychometric 
characteristics and also in difficulty level. The new 
test promises to serve an important role in research 
studies that require retesting. 


Comment on the RPM 


Even though the RPM has not lived up to its origi- 
nal intentions of measuring Spearman’s g factor, 
the test is nonetheless a useful index of nonverbal, 
figural reasoning. The recent updating of norms 
was a much-welcomed development for this well- 
known test, in that many American users were leary 
of the outdated and limited British norms. Nonethe- 
less, adult norms for the Standard and Advanced 
Progressive Matrices are still quite limited. 

The RPM is particularly valuable for the sup- 
plemental testing of children and adults with hear- 
ing, language, or physical disabilities. Often, these 
examinees are difficult to assess with traditional 
measures that require auditory attention, verbal ex- 
pression, or physical manipulation. In contrast, the 
RPM can be explained through pantomime, if nec- 
essary. Moreover, the only output required of the 
examinee is a pencil mark or gesture denoting the 
chosen alternative. For these reasons, the RPM is 
ideally suited for testing persons with limited com- 
mand of the English language. In fact, the RPM is 
about as culturally reduced as possible: The test 
protocol does not contain a single word in any lan- 
guage. Mills & Tissot (1995) found that the Ad- 
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vanced Progressive Matrices identified a higher 
proportion of minority children as gifted than did a 
more traditional measure of academic aptitude (the 
School and College Ability Test). 

A final note of caution: Some very bright and 
high-functioning persons perform abysmally on the 
RPM. Gregory and Gernert (1990) tested nearly 100 
university faculty members with a variant of the 
RPM. One participant, an accomplished researcher 
who had risen to a vice presidential level, hadn’t the 
slightest clue how to solve the RPM problems and 
scored at a chance level. Some persons of above- 
average intelligence simply do not perform well on 
figural-reasoning tasks. Examiners would be well 
advised to question the validity of a low score ob- 
tained by an otherwise accomplished individual. 


PERSPECTIVE ON 
CULTURE-FAIR TESTS 


Cattell’s Culture-Fair Intelligence Test (CFIT) and 
Raven’s Progressive Matrices (RPM) are often 
cited as examples of culture-fair tests, a concept 
with a long and confused history. We will attempt 
to clarify terms and issues here. 

The first point to make is that intelligence tests 
are merely samples of what people know and can 
do. We must not reify intelligence and overvalue in- 
telligence tests. Tests are never samples of innate 
intelligence or culture-free knowledge. All knowl- 
edge is based in culture and acquired over time. As 
Scarr (1994) notes, there is no such thing as a 
culture-free test. 

But what about a culture-fair test, one that poses 
problems that are equally familiar (or unfamiliar) to 
all cultures? This would appear to be a more realis- 
tic possibility than a culture-free test, but even here 
the skeptic can raise objections. Consider the ques- 
tion of what a test means, which differs from cul- 
ture to culture. In theory, a test of matrices would 
appear to be equally fair to most cultures. But in 
practice, issues of equity arise. Persons reared in 
Western cultures are trained in linear, convergent 
thinking. We know that the purpose of a test is to 
find the single, best answer and to do so quickly. We 
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examine the 3 x 3 matrix from left to right and top 
to bottom, looking for the logical principles invoked 
in the succession of forms. Can we assume that 
persons reared in Nepal or New Guinea or even the 
remote, rural stretches of Idaho will do the same? 
The test may mean something different to them. Per- 
haps they will approach it as a measure of aesthetic 
progression rather than logical succession. Perhaps 


they will regard it as so much silliness not worthy 
of intense intellectual effort. To assume that a test is 
equally fair to all cultural groups merely because the 
stimuli are equally familiar (or unfamiliar) is in- 
appropriate. We can talk about degrees of cultural 
fairness (or unfairness), but the notion that any test 
is absolutely culture-fair surely is mistaken. 


SUMMARY 


1. Group intelligence tests differ from indi- 
vidual tests in five ways: multiple-choice versus 
open-ended format, objective machine scoring ver- 
sus examiner scoring, group versus individualized 
administration, applications in screening versus re- 
medial planning, and huge versus merely large 
standardization samples. 


2. The obvious advantage of group testing is 
that large numbers of examinees can be tested 
quickly and efficiently. The disadvantage of group 
testing is that examinees may score far below their 
true ability because of motivational problems or dif- 
ficulty following directions. 


3. The Multidimensional Aptitude Battery 
(MAB) is a multiple-choice group intelligence test 
designed to be a paper-and-pencil equivalent of the 
WAIS-R. MAB and WAIS-R IQs correlate .82, .65, 
and .91, for Verbal, Performance, and Full Scale IQ, 
respectively. Test-retest reliability of the instrument 
is excellent, and factor analyses support its construct 
validity. 

4. The Shipley Institute of Living Scale 
(SILS) was originally proposed as an index of in- 
tellectual deterioration. The SILS consists of a 40- 
item multiple-choice vocabulary section and a 
20-item fill-in-the-blank abstractions section. The 
test has not functioned well as an index of organic- 
ity, but does meet a need as a brief screening device 
for general intelligence. 

5. The Cognitive Abilities Test (CogAT) is 
representative of the many multilevel, school-based 
test batteries in current use. The nine subtests of the 


CogAT include a Verbal Battery, a Quantitative Bat- 
tery, and a Nonverbal Battery. The test is co-normed 
with two achievement tests, the Iowa Test of Basic 
Skills and the Tests of Educational Development. 


6. The Culture Fair Intelligence Test (CFIT) is 
a nonverbal measure of fluid intelligence that at- 
tempts to minimize cultural bias. The CFIT is suited 
for ages four through adult and comes in three ver- 
sions, each consisting of two equivalent forms. Each 
form consists of four subtests: Series, Classification, 
Matrices, and Conditions. 


7. The reliability of the CFIT is superb and 
scores correlate very strongly with other respected 
tests of intelligence. The CFIT is a good test of in- 
telligence, but is probably as culturally bound as 
most traditional tests. The test needs revision and 
restandardization. 


8. Originally designed as a measure of Spear- 
man’s g factor, Raven’s Progressive Matrices (RPM) 
is a nonverbal test of inductive reasoning based on 
figural stimuli that comes in three different versions: 
Coloured Progressive Matrices (ages 5 to 11), Stan- 
dard Progressive Matrices (ages 6 through adult), 
and Advanced Progressive Matrices. 


9. Although the RPM is a reliable and valid 
index of intelligence, there is little support for the 
test as a unitary measure of the g factor. Factor 
analyses usually reveal two or three factors, includ- 
ing reasoning by analogy and simple pattern com- 
pletion. The RPM is useful for the supplemental 
testing of persons with hearing, language, or phys- 
ical disabilities. 


10. Culture-fair testing is an idealized abstrac- 
tion that is never achieved in the real world. Even 
the meaning of a test may differ among cultural 
groups, which will affect the validity of compar- 
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isons. Some tests are more culture-fair than others, 
but it is not possible for any test to be equally fair to 
all cultural groups. 


KEY TERMS AND CONCEPTS 


index of intellectual deterioration p. 206 
culture-fair test p. 212 
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Summary 


Key Terms and Concepts 


T: individual and group intelligence tests re- 
viewed in previous chapters are suitable for 
persons with normal or near-normal capacities in 
speech, hearing, vision, movement, and general in- 
tellectual ability. However, not every examinee falls 
within the ordinary spectrum of physical and men- 
tal abilities. By reason of youthful age, physical 
disability, diminished intellect, or language disad- 
vantage, a large proportion of the population falls 
outside the reach of traditional tests and proce- 
dures. According to the U.S. Census Bureau, about 
25 million Americans (one in ten) have a severe 
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disability that prevents them from performing one 
or more activities or roles (www.census.gov, 1998). 
This estimate does not include persons living in in- 
stitutions. In these special cases, novel tests are 
needed for valid assessment. In Topic 7A, Testing 
Special Populations, we discuss instruments de- 
signed for exceptional and difficult consultations, 
such as persons with sensory/motor impairment, re- 
cent immigrants from non-English-speaking coun- 
tries, and individuals with significant intellectual 
deficiencies. In Topic 7B, Test Bias and Other Con- 
troversies, we continue a circumspect theme by 


raising a number of concerns about the use and 
meaning of intelligence test scores. 

ORIGINS OF TESTS FOR 

SPECIAL POPULATIONS 
Beginning in the 1950s, a renewed commitment to 
the needs and rights of physically and mentally dis- 
abled persons arose in the United States (Maloney 
& Ward, 1979; Patton, Payne, & Beirne-Smith, 
1986). Societal attitudes toward those with special 
needs shifted from outright disdain to a more sup- 
portive stance that favored new programs and ini- 
tiatives on behalf ofthe disabled. Progress has been 
slow, but we are no longer surprised to see bath- 
room facilities with wheelchair access for persons 
with physical disability, large-print books for per- 
sons with visual impairments, or closed-captioned 
television programs for persons with hearing dis- 
abilities. Furthermore, the special needs of citizens 
with mental retardation are increasingly served by 
small community care facilities instead of massive, 
impersonal institutions. 

In the early 1970s, the renewed concern for the 
needs of disabled persons was translated into fed- 
eral legislation. In 1973, Public Law 93-112 was 
passed, serving as a “Bill of Rights” for disabled in- 
dividuals. This legislation outlawed discrimination 
on the basis of disability. Two years later, the land- 
mark Education for All Handicapped Children Act 
(Public Law 94-142) was enacted. This legislation 
mandated that disabled schoolchildren receive ap- 
propriate assessment and educational opportunities. 
In particular, psychologists were directed to assess 
children in all areas of possible disability—mental, 
behavioral, and physical—and to use instruments 
validated for those express purposes. 

In this topic, we examine tests that can be used 
for the assessment of persons with sensory, motor, or 
mental disabilities. However, before discussing spe- 
cific tests, we review certain distinctions between 
the types of tests that are available for exceptional 
assessments. The reader also will appreciate a brief 
summary of the legal mandates that have shaped as- 
sessment practices with disabled individuals. 
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Approaches to Assessment 
of Special Populations 


Special tests were first devised in the early 1900s 
to test non-English-speaking immigrants, people 
who are deaf, and persons with speech defects 
(DuBois, 1970). These early special instruments 
were largely performance or nonlanguage tests that 
could be administered by pantomime. The exami- 
nee manipulated objects or used paper and pencil 
to complete easy-to-understand tasks such as trac- 
ing a path through a maze. 

Special instruments also have been devised for 
nonreading examinees who possess some ability to 
understand spoken English. These nonreading tests 
are intended for young children and other illiterate 
persons who nonetheless can comprehend and fol- 
low oral instructions, Many nonreading tests in- 
volve the manipulation of objects. However, a 
nonreading test also can assess language compre- 
hension skills by using a picture vocabulary format: 
The examiner says a word and the examinee points 
to the one picture from an array of pictures that de- 
picts the word. Several picture vocabulary tests are 
discussed subsequently. 

A motor-reduced test requires the barest mini- 
mum of motor output for a response. In a motor-re- 
duced test, the examinee merely points or gestures 
to the correct answer from among several alterna- 
tives. For example, an examinee with cerebral palsy 
might respond to picture vocabulary items by plac- 
ing a hand over the chosen alternative. Some non- 
reading tests—particularly those that use a picture 
vocabulary format—are also motor-reduced tests. 

Finally, we should mention that several impor- 
tant assessment devices are not really tests at all. A 
developmental schedule is a standardized device 
for observing and evaluating the behavioral devel- 
opment of infants and young children. These in- 
struments usually inquire into major developmental 
milestones such as sitting alone, standing unaided, 
and so forth. It is characteristic of such tools that 
the “examinee” doesn’t take a test per se or, for 
that matter, do anything out of the ordinary. A de- 
velopmental schedule is really just a structured 
form of observation. Likewise, a behavior scale is 
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an instrument for determining the profile of be- 
havioral skills (and perhaps excesses) exhibited by 
a child or adult with mental retardation. Behavior 
scales are usually filled out by a knowledgeable 
adult (parent, teacher, or psychologist). 


THE LEGAL MANDATE FOR ASSESSING 
PERSONS WITH DISABILITIES 


Many practices in the assessment of persons with 
disabilities are the direct result of legislation and 
court cases. As background to the discussion of 
specific tests and procedures, we offer a quick re- 
view of public laws relevant to the assessment of 
persons with disabilities. The coverage is purpose- 
fully brief. Readers can find lengthier discussions 
in Bruyere and O’Keeffe (1994), Salvia and Ys- 
seldyke (2001), and Stefan (2001). 


Public Law 94-142 


In 1975, the U.S. Congress passed a compulsory 
special education law, Public Law 94-142, known 
as the Education for All Handicapped Children 
Act.! According to Ballard and Zettel (1977) this 
law was designed to meet four major goals: 


1. To ensure that special education services are 
available to children who need them 

2. To guarantee that decisions about services to 
disabled students are fair and appropriate 

3. To establish specific management and auditing 
requirements for special education 

4. To provide federal funds to help the states edu- 
cate disabled students 


Many practices in the assessment of disabled 
persons stem directly from the provisions of Pub- 
lic Law 94-142. For example, the law specifies that 
each disabled student must receive an individual- 
ized education plan (IEP) based on a comprehen- 
sive assessment by a multidisciplinary team. The 


1. Each congressional law receives two numbers, one referring 
to the particular Congress that passed it, the other referring to 
the law itself. Thus, Public Law 94-142 is the 142nd law passed 
by the 94th Congress. 


IEP must outline long-term and short-term objec- 
tives and specify plans for achieving them. In ad- 
dition, the IEP must indicate how progress toward 
these objectives will be evaluated. The parents are 
intimately involved in this process and must ap- 
prove the particulars of the IEP. 

Pertinent to testing practices, PL 94-142 
includes a number of provisions designed to en- 
sure that assessment procedures and activities are 
fair, equitable, and nondiscriminatory. Salvia and 
Ysseldyke (1988) summarize these provisions as 
follows: 


1. Tests are to be selected and administered in such 
a way as to be racially and culturally nondis- 
criminatory. 

2. To the extent feasible, students are to be as- 
sessed in their native language or primary mode 
of communication. 

3. Tests must have been validated for the specific 
purpose for which they are used. 

4. Tests must be administered by trained personnel 
in conformance with the instructions provided 
by the test producer. 

5. Tests used with students must include those 
designed to provide information about specific 
educational needs, and not just a general intelli- 
gence quotient. 

6. Decisions about students are to be based on 
more than performance on a single test. 

7. Evaluations are to be made by a multidiscipli- 
nary team that includes at least one teacher or 
other specialist with knowledge in the area of 
suspected disability. 

8. Children must be assessed in all areas related 
to a specific disability, including—when appro- 
priate—health, vision, hearing, social and 
emotional status, general intelligence, academic 
performance, communicative skills, and motor 
skills. 


PL 94-142 also contains a provision that dis- 
abled students should be placed in the least restric- 
tive environment—one that allows the maximum 
possible opportunity to interact with nonimpaired 
students. Separate schooling is to occur only when 
the nature or the severity of the disability is such 


that instructional goals cannot be achieved in the 
regular classroom. Finally, the law contains a due 
process clause that guarantees an impartial hearing 
to resolve conflicts between the parents of disabled 
children and the school system. 

In general, the provisions of PL 94-142 have 
provided strong impetus to the development of spe- 
cialized tests that are designed, normed, and vali- 
dated for children with specific disabilities. For 
example, in the assessment of a child with visual 
impairment, the provisions of PL 94-142 virtually 
dictate that the examiner must use a well-normed 
test devised just for this population rather than re- 
lying upon traditional instruments. 


Public Law 99-457 


In 1986, Congress passed several amendments to 
the Education for All Handicapped Children Act, 
expanding the provisions of PL 94-142 to include 
disabled preschool children. Public Law 99-457 
requires states to provide free appropriate public 
education to disabled children ages 3 through 5. 
The law also mandates financial grants to states that 
offer interdisciplinary educational services to dis- 
abled infants, toddlers, and their families, thus es- 
tablishing a huge incentive for states to serve 
children with disabilities from birth through age 2. 
Public Law 99-457 also provides a major impetus 
to the development and validation of infant tests 
and developmental schedules. After all, the early 
and accurate identification of at-risk children 
would appear to be the crucial first step in effective 
interdisciplinary intervention. 


Americans with Disabilities Act 


The 1990 Americans with Disabilities Act (ADA) 
forbids discrimination against qualified individuals 
with disabilities in both the public sector (e.g., gov- 
ernment agencies and entities receiving federal 
grants) and the private sector (e.g., corporations 
and other for-profit employers). Under the ADA, 
disability is defined as a physical or mental im- 
pairment that substantially limits one or more of the 
major life activities (Parry, 1997). Examples of 
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ADA-recognized disabilities include sensory and 
physical impairments (e.g., blindness, paralysis), 
many mental illnesses (e.g., major depression, 
schizophrenia), learning disabilities, and attention- 
deficit/hyperactivity disorder. 

Under the ADA, the process of qualifying an in- 
dividual for work or educational accommodations 
requires current, detailed, and professional docu- 
mentation. For example, a graduate student who 
was seeking a special arrangement for taking tests 
(such as a quiet room) because of attentional prob- 
lems might need to submit a comprehensive en- 
dorsement from a licensed psychologist, detailing 
the history, current functioning, clinical diagnosis 
of attention-deficit/hyperactivity disorder, and ne- 
cessity for accommodations (Gordon & Keiser, 
1998). In other words, the ADA is a civil rights act, 
not a program of entitlement: 

The ADA does not guarantee equal outcomes, 

establish quotas, or require preferences favoring 

individuals with disabilities. Rather, the ADA is 
intended to ensure access to equal employment 
opportunities based on merit. The ADA is designed 
to “level the playing field” by removing the barri- 
ers that prevent qualified individuals with disabili- 
ties from having access to the same employment 
opportunities that are available to individuals with- 

out disabilities. (Klimoski & Palmer, 1994, p. 45) 


In sum, the purpose is to ensure that individuals 
who are otherwise qualified for jobs or educational 
programs are not denied access or put at improper 
disadvantage simply because of a disability. 

In regard to psychological testing, an important 
provision of the ADA is that agencies and institu- 
tions must make reasonable testing accommoda- 
tions for persons with disabilities. With appropriate 
documentation (discussed earlier), the relevant ac- 
commodations might include any of the following: 


e Assistance in completing answer sheets 
Audiotape or oral presentation of written tests 
Special seating for tests 

Large-print examinations 

Retaking exams 

Dictating rather than writing test answers 
Printed version of verbal instructions 
Extended time limit 
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In general, changes in the testing medium (e.g., 
from written to oral) are consistent with the inten- 
tion of ADA, if such a change is needed to accom- 
modate a disability. For example, an appropriate 
accommodation in the testing medium would be the 
audiotaped presentation of test items for-persons 
who are visually impaired. On the other hand, 
changing a test from a printed version into a sign 
language version for persons with hearing impair- 
ment would be considered translation into another 
language, not a simple change of medium. 

In most testing accommodations mandated by 
the ADA, it is necessary to change the time limits, 
usually by providing extra time. This raises prob- 
lems of test interpretation, especially when a strict 
time limit is essential to the validity of a test. For 
example, Willingham, Ragosta, Bennett, and oth- 
ers (1988) found that extended time limits on the 
SAT significantly reduced the validity of the test as 
a predictor of first-year college grades. This was es- 
pecially true for examinees with learning disabili- 
ties, whose first-year grades were subsequently 
overpredicted by their SAT scores. Thus, although 
it seems fair to provide extra time on a test when the 
testing medium has been changed (e.g., audiotaped 
questions replacing the printed versions), from a 
psychometric standpoint, the challenge is to deter- 
mine how much extra time should be provided so 
that the modified test is comparable to the original 
version. Nester (1994) and Phillips (1994) provide 
thoughtful perspectives on the range of reasonable 
accommodations required by the ADA. 

Now that we have summarized the legal back- 
ground to the assessment of persons with special 
needs, we turn to a review of typical instruments 
used for the testing of individuals with disabilities. 
We organize the review around the following 
topics: nonlanguage tests, nonreading and motor- 
reduced tests, tests for persons with visual impair- 
ment, and the assessment of adaptive behavior in 
those with mental retardation. 


I] NONLANGUAGE TESTS 


As the reader will recall, nonlanguage tests require 
little or no written or spoken language from exam- 


iner or examinee. Thus, they are particularly suited 
for assessment of non-English-speaking persons, 
referrals with speech impairments, and examinees 
with weak language skills. These instruments can 
also be used as supplementary tests for examinees 
who have no disabilities. 


Leiter International Performance 
Scale-Revised 


The Leiter International Performance Scale-Re- 
vised (LIPS-R, Roid & Miller, 1997) is a recent re- 
vision of a classic and highly praised test of 
nonverbal intelligence and cognitive abilities 
(Leiter, 1948, 1979). Leiter devised an experimen- 
tal edition of the test in 1929 to assess the intelli- 
gence of those with hearing or speech impairment, 
those who were bilingual, or non-English-speaking 
examinees. The scale was field-tested with several 
ethnic groups in Hawaii, including children of 
Japanese and Chinese descent. The first edition was 
based upon test results for American children, high- 
school students, and WWII Army recruits. Although 
highly praised and widely used after its initial re- 
lease, this test received strong criticism in recent 
years because of poor illustrations and outdated 
norms. The revised Leiter answers all criticisms 
handily, andthe LIPS-R deserves wide use as a cul- 
ture-reduced measure of nonverbal intelligence. 

A remarkable feature of the Leiter is the 
complete elimination of verbal instructions. The 
Leiter-R does not require a single spoken word 
from the examiner or the examinee. With an age 
range of 2 years to 20 years and 11 months, the 
Leiter-R is particularly suitable for children and 
adolescents whose English language skills are 
weak. This includes children with any of these 
features: non-English-speaking, autism, traumatic 
brain injury, speech impairment, hearing problems, 
or an impoverished environment. The test is also 
useful in the assessment of attentional problems, as 
described in the following. 

Testing is performed by the child or adolescent 
matching small laminated cards underneath corre- 
sponding illustrations on an easel display (Figure 
7.1). The test is untimed. Because the initial items 





FIGURE 7.1 A Characteristic Item from the Leiter 
International Performance Scale-Revised 


are transparently obvious, most examinees catch on 
quickly without need of pantomime demonstration. 
The Leiter-R contains 20 subtests organized into 
four domains: Reasoning, Visualization, Memory, 
and Attention. Not all subtests are administered to 
every child. For example, the figure rotation subtest 
is too difficult for 2-year-olds and the immediate 
recognition subtest is too easy for adolescent 
examinees. The four Reasoning subtests include 
classification and design analogies. The six Visual- 
ization subtests include matching, figure-ground, 
paper folding, and figure rotation. The eight Mem- 
ory subtests include memory span, spatial memory, 
associative memory, and delayed recognition mem- 
ory. The two Attention subtests consist of an un- 
derlining test (e.g., marking all squares printed on 
a page full of geometric shapes) and a measure of 
divided attention (e.g., observing a moving display 
and simultaneously sorting cards correctly). 

The Leiter-R yields a composite IQ with the fa- 
miliar mean of 100 and standard deviation of 15. 
The test also produces subtest scaled scores with a 
mean of 10 and standard deviation of 3, as well as 
a variety of composite scores useful in clinical di- 
agnosis. The test was normed on over 2,000 chil- 
dren and adolescents, from 2 to 21 years of age. 
Using 1993 census statistics, these subjects were 
carefully stratified according to race, age, gender, 
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social class, and geographic region. Internal con- 
sistency reliability for subtests, domain scores, and 
IQ scores is excellent. Typical coefficient alphas 
are in the high .80s for subtests and the low .90s for 
domain scores and IQ scores. Extensive studies of 
item bias reveal that the items appear to function 
similarly in separate racial groups (white, African 
American, and Hispanic samples); that is, there is 
no evidence of bias (defined as differential item 
functioning). Coupled with the fact that the test is 
completely nonverbal, the absence of test bias in- 
dicates that the Leiter-R is a good choice for cul- 
ture-reduced testing of minority children. 
Empirical research with the Leiter-R is scant 
at this time. The test has been shown to have utility 
in the assessment of medically fragile children 
(Hooper, Hatton, Baranek, Roberts, & Bailey, 
2000) and the evaluation of children classified as 
language impaired (Farrell & Phelps, 2000). In 
this latter study, the Leiter-R also demonstrated a 
validity-confirming correlation of r = .80 with 
another nonverbal measure of intelligence. Studies 
with the first edition indicate strong relationships 
with other intelligence test scores. For example, the 
Leiter and the WISC Performance IQ correlated 
near the .80s; correlations with the WISC Verbal IQ 
are more typically in the .60s (Arthur, 1950; Matey, 
1984). Reeve, French, and Hunter (1983) compared 
the Leiter and the Stanford-Binet: Form L-M as 
predictors of Metropolitan Achievement Test 
scores for 60 kindergartners. Correlations were .77 
between Stanford-Binet and MAT total, and .61 be- 
tween Leiter and MAT total. The authors note that 
although the Stanford-Binet proved to be a mar- 
ginally better predictor of standard achievement, 
children with hearing and/or speech problems may 
require the Leiter or other nonverbal instruments. 
The Leiter-R is a welcome revision of an obso- 
lete test. In the hands of a careful clinician, the test 
is helpful in the intellectual assessment of children 
with weak skills in English. Other uses for the re- 
vised test include the assessment of attention- 
deficit/hyperactivity disorder (comparisons of the 
Attention subtests with the other domains are cru- 
cial here) and the evaluation of giftedness in young 
children (the extremely high ceiling of the test 
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proves invaluable for this application). Whereas re- 
viewers warned against using the original Leiter for 
placement or decision-making purposes (Sattler, 
1988; Salvia & Ysseldyke, 1991), the revised Leiter 
is a huge improvement in regards to psychometric 
quality and standardization excellence. Thorough 
reviews of the Leiter-R and other nonverbal as- 
sessment instruments are provided by Athanasiou 
(2000) and McCallum, Bracken, and Wasserman 
(2001). 


Human Figure Drawing Tests 


Most children enjoy drawing human figures and do 
so routinely and spontaneously. Since the early 
1900s, psychologists have tried to tap into this al- 
most instinctive behavior as a basis for measuring 
intellectual development. The first person to use 
human figure drawing (HFD) as a standardized in- 
telligence test was Florence Goodenough (1926). 
Her test, known as the Draw-A-Man test, was 
revised by Harris (1963) and renamed the Good- 
enough-Harris Drawing Test. More recently, the 
HFD technique has been adapted by Naglieri 
(1988). An additional approach by Gonzales (1986) 
is not reviewed here. We should also mention that 
human figure drawings are widely used as measures 
of emotional adjustment, but we do not discuss that 
application here. 

The Goodenough-Harris Drawing Test is a 
brief, nonverbal test of intelligence that can be ad- 
ministered individually or in a group. Goodenough 
(1926) published the first edition of this test, while 
Harris (1963) provided important refinements in 
scoring and standardization, including the use of a 
deviation IQ. Strictly speaking, the Goodenough- 
Harris test doesn’t fit the criteria for nonlanguage 
tests insofar as the examiner must convey certain 
instructions in English or through a translator. 
However, the instructions are brief and basic (“I 
want you to draw a picture of a man [or woman]; 
make the very best picture you can”). The Good- 
enough-Harris test is, for all practical purposes, a 
nonlanguage test. 

The purpose of the Goodenough-Harris Draw- 
ing Test is to measure intellectual maturity, not 


artistic skill. Thus, the scoring guide emphasizes 
accuracy of observation and the development of 
conceptual thinking. The child receives credit for 
including body parts and details, as well as for pro- 
viding perspective, realistic proportion, and im- 
plied freedom of movement. 

The 73 scorable items were selected according 
to the following criteria: 


1. The items should show a regular and fairly rapid 
increase with age, in the percentage of children 
passing the point. 

2. The items should show a relationship to some 
general measure of intelligence. 

3. The items should differentiate between children 
scoring high on the scale as a whole and those 
scoring low on the scale as a whole (Harris, 
1963). 


A sample drawing with item-by-item scoring is de- 
picted in Figure 7.2. 

In addition to the Man scale, the Harris (1963) 
revision also includes two additional forms: the 
Woman scale and the Self scale. For these last two 
scales, examinees are instructed to draw a picture 
of a woman and of themselves. Scores on the Man 
and Woman scales are very highly correlated for 
examinees of either sex (r = .91 to .98). These two 
versions can be considered equivalent forms. The 
Self scale was intended as a projective test of self- 
concept. However, self-concept is a fuzzy construct 
that is difficult to objectify. The Self scale has 
largely fallen by the wayside, although some psy- 
chologists use it purely as an unscored extension of 
the clinical interview. 

The standardization sample for the Good- 
enough-Harris Drawing Test was large (N = 2,975 
children), geographically varied (from urban and 
rural areas throughout the United States), and care- 
fully selected to match U.S. population values for 
parental occupational status. The test covers ages 3 
to 16, but the norms are best for ages 5 to 12. Be- 
yond age 12, examinees begin to approach an as- 
ymptote of performance and age differences are 
reduced. The Man scale yields a deviation IQ-like 
standard score with mean of 100 and standard 





Draw a picture of aman. Make the very best picture 
you can. Be sure to make the whole man, not just his 
head and shoulders. 





Note: This effort by an eight-year-old girl converts to a Standard 
Score of 118 and a percentile rank of 88. 





FIGURE 7.2 Goodenough-Harris Drawing Test with 
Item-by-Item Scoring 

Source: Goodenough-Harris Drawing Test. Copyright © 1963 by 
Harcourt Educational Measurement, a Harcourt Assessment Com- 
pany. Reprinted by permission. All rights reserved. 
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deviation of 15. One concern is simply that Draw- 
ing Test norms are now quite dated. Abell, 
Horkheimer, and Nguyen (1998) found that the 
scoring system for this test consistently underesti- 
mated IQ scores on the WISC-R. 

The reliability of the test has been assessed by 
split-half procedures, test-retest studies, and inter- 
scorer comparisons (Anastasi, 1975; Frederickson, 
1985; Harris, 1963). Split-half reliabilities near .90 
are common. However, stability coefficients sel- 
dom exceed the .70s, even when the test-retest in- 
terval is only a few weeks. This suggests that scores 
on the Goodenough-Harris Drawing Test possess a 
sizeable band of measurement error. On the other 
hand, scoring is quite objective: Interscorer corre- 
lations are typically in the .90s. 

Examiners who have mastered the elaborate 
point scoring system may then use a simpler global 
method called the Quality Scale. The Quality Scale 
consists of 24 drawings (12 for the Man scale and 
12 for the Woman scale) used as standardized ref- 
erence points. The examiner matches the exami- 
nee’s drawing to one of the 12 reference drawings, 
then consults a table to determine the correspond- 
ing standard score. The Quality score is quicker, but 
slightly cruder: Interscorer reliabilities are typi- 
cally in the low .80s. 

The Goodenough-Harris test is often used as a 
nonverbal measure of cognitive ability with chil- 
dren who have language disabilities and minority 
or bilingual children. Oakland and Dowling (1983) 
view the Drawing Test as a culturally reduced test 
that is appropriate for initial screening of minority 
children. The test works best with younger chil- 
dren, particularly those with lower intellectual 
ability (Scott, 1981). For samples of 5-year-old 
children at a day care center for lower socioeco- 
nomic families, Frederickson (1985) reported cor- 
relations between Goodenough-Harris Drawing 
Test scores and WPPSI Full Scale IQ in the range 
of .72 to .80. In several other studies, correlations 
with individual IQ tests are more variable, but the 
majority are over .50 (Abell, Briesen, & Watz, 
1996; Anastasi, 1975). 

In response to criticisms of the Goodenough- 
Harris Drawing Test, Naglieri (1988) developed a 
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quantitative scoring system and renormed the human 
figure drawing procedure. His scoring system, The 
Draw A Person: A Quantitative Scoring System 
(DAP), was:normed on a sample of 2,622 individu- 
als ages 5 through 17 years who were representative 
of the 1980 U.S. Census data on age, sex, race; geo- 
graphic region, ethnic group, social class, and com- 
munity size. The DAP yields standard scores with 
the familiar mean of 100 and standard deviation of 
15. In a study of 61 subjects ages 6 to 16 years, the 
DAP correlated..51 with WISC-R IQ and produced 
similar overall scores, with a mean IQ of 100 versus 
mean DAP score of 95 (Wisniewski & Naglieri, 
1989). Lassiter and Bardos (1995) found that the 
DAP score underestimated IQ scores obtained from 
the WPPSI-R and the K-BIT in a sample of 50 
kindergartners and first graders. 

Reviewers praise the DAP for its clear scoring 
system, strong reliability, and careful standardiza- 
tion (Cosden, 1992). However, results of validity 
studies are more cautionary. Harrison and Schock 
(1994) note that the accumulated evidence with 
HED tests indicates low to moderate predictive va- 
lidity. In spite of their popularity and appeal, HFD 
tests do not effectively identify children with learn- 
ing difficulties or developmental disabilities, and 
they may not be valid for use even as screening 
measures. 


Hiskey-Nebraska Test of Learning Aptitude 


The Hiskey-Nebraska Test of Learning Aptitude 
(H-NTLA) is a nonlanguage performance scale for 
use with children ages 3 to 17 years (Hiskey, 1966). 
This test can be administered entirely through pan- 
tomime and requires no verbal response from the 
examinee. However, verbal instructions can be used 
with children with normal and mild hearing im- 
pairment. The H-NTLA consists of 12 subtests: 


Bead Patterns Block Patterns 

Memory for Color Completion of Drawings 
Picture Identification | Memory for Digits 
Picture Association Puzzle Blocks 

Paper Folding Picture Analogies 

Visual Attention Span Spatial Reasoning 


Raw scores on the subtests are converted into a 
Deviation Learning Quotient (LQ) with mean of 
100 and standard deviation of 16. H-NTLA scores 
correlate quite robustly with achievement scales for 
grades 2 through 12 (median r = .49) and also with 
WISC-R Performance IQ (r = .85). Although the 
LQ yields average scores that are remarkably close 
to WISC-R Performance IQ for samples of children 
with hearing impairment and those who are deaf, 
the H-NTLA scores are substantially more variable 
(Watson & Goldgar, 1985; Phelps & Ensor, 1986). 
Thus, use of the H-NTLA may increase the risk of 
false positive misclassification—labeling children 
as gifted when they are only bright, or as having 
mental retardation when they are merely border- 
line. 

The H-NTLA is useful with children who are 
deaf, have speech or language impairments or men- 
tal retardation, or those who are bilingual. An in- 
teresting feature of this test is the development of 
parallel norms: The H-NTLA was standardized on 
1,079 children who were deaf and 1,074 normal- 
hearing children ages 2% to 17%. However, the 
chief weakness of the instrument is the inadequacy 
of these norms. For example, the representativeness 
of the sample of those who were deaf—picked on 
an opportunistic basis from schools for those who 
are deaf—is largely unknown. Standardization of 
the normal-hearing sample was based on occupa- 
tional level of parents according to the 1960 U.S. 
Census. A contemporary and more detailed re- 
standardization of the test would be quite helpful. 


Test of Nonverbal Intelligence-3 


The Test of Nonverbal Intelligence-3 (TONI-3) is a 
language-free measure of cognitive ability de- 
signed for disabled or minority populations 
(Brown, Sherbenou, & Johnsen, 1998). In particu- 
lar, the authors recommend the test for assessing 
persons with aphasia, non-English speakers, those 
with hearing impairments, and persons who have 
experienced a variety of severe neurological trau- 
mas. The test instructions are pantomimed by the 
examiner and the examinee answers by pointing to 
one of six possible responses. The test consists of 


two equivalent forms of 50 abstract/figural prob- 
lem-solving items. These items were carefully se- 
lected from an initial pool of items according to 
item-total correlations, appropriate difficulty level, 
and acceptability to potential users and technical 
experts. The TONI-3 items fall into several cate- 
gories, including the following: 


Simple matching 
Analogies 
Classification 
Intersection 
Progressions 


Except for the simple-matching items, the TONI-3 
items require the examinee to solve problems by 
identifying relationships among abstract figures. 
Many of the items are similar in format to those 
found on Raven’s Progressive Matrices. The test 
yields two kinds of scores: percentile ranks and 
TONI-3 quotients (mean of 100 and standard devi- 
ation of 15). 

The TONI-3 was carefully standardized on over 
3,000 subjects ranging in age from 6 through 89. 
Sample characteristics paralleled census data for 
sex, race, ethnicity, urban-suburban-rural resi- 
dence, grade, parental education/occupation, and 
geographic region. Reliability data are quite satis- 
factory, with internal consistency coefficients typ- 
ically exceeding .90 and alternate-forms reliability 
in the range of .80 to .95. 

Validity studies of the TONI-3 are scant, but in- 
vestigation of prior editions (which are highly sim- 
ilar in content) are supportive of this test as a 
culture-reduced index of general intelligence. 
Nonetheless, research does not support the view 
that the TONI-3 is a nonverbal test, except in the 
trivial sense that verbal responses are not required. 
For example, the TONI-2 manual reports correla- 
tion coefficients in the .70s between TONI-2 scores 
and the Language Arts subtest of the SRA Achieve- 
ment Series. In general, research studies with pre- 
cursors to the TONI-3 indicate that it is a good 
measure of general intelligence, but they do not 
support the view that it is mainly a measure of non- 
verbal intelligence (Murphy, 1992). Overall, the 
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TONI-3 is highly regarded as a brief nonlanguage 
screening device for subjects with impaired lan- 
guage abilities (e.g., for those who are aphasic, 
deaf, or non-English-speaking or who have mental 
retardation). The test is more carefully standardized 
than most and possesses excellent reliability. A use- 
ful feature of the TONI-3 is that the untimed ad- 
ministration seldom exceeds 20 minutes. 

Two instruments discussed earlier in the text 
also qualify as nonlanguage tests. Raven’s Pro- 
gressive Matrices and the Cattell Culture Fair In- 
telligence Test utilize nonverbal items and require 
essentially no language-based interactions between 
examiner and examinee. A new and promising 
language-free test is the Universal Nonverbal In- 
telligence Test (UNIT), a comprehensive and mul- 
tidimensional measure of nonverbal intelligence 
(McCallum & Bracken, 1997; Reed & McCallum, 
1995). This test is designed for children with hear- 
ing impairment or limited English proficiency. So- 
phisticated item analyses indicate that the UNIT is 
an unbiased measure of nonverbal intelligence in 
children who are profoundly deaf (Maller, 2000). 
The UNIT provides a good measure of g and sev- 
eral subscores, including clear, factor-based scores 
on memory and reasoning. 

NONREADING AND 

MOTOR-REDUCED TESTS 
As the reader will recall, nonreading tests are de- 
signed for illiterate examinees who can, nonethe- 
less, understand spoken English well enough to 
follow oral instructions. Nonreading tests of intel- 
ligence are well suited to young children, illiterate 
examinees, and persons with speech or expressive- 
language impairments. These tests need not be 
specialized or esoteric: The performance subtests 
of most mainstream instruments qualify as non- 
reading tests. For example, examiners may use the 
WISC-III performance subtests to estimate the in- 
telligence of examinees with language disabilities. 

However, clients with cerebral palsy or other 
orthopedically impairing conditions will score very 
poorly on nonreading tests that require manipulatory 
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responses. Obtaining valid test results from such 
persons can present an enormous challenge (Case 
Exhibit 7.1). The motor deficits, increased ten- 
dency to fatigue, and inexactness of purposive 
movements common to persons with cerebral palsy 





will negatively affect their performance on cogni- 
tive assessment tools. Orthopedically impaired 
clients need tests that are both nonreading and 
motor-reduced. In particular, tests that permit a 
simple pointing response are well suited to the as- 


sessment of children and adults with cerebral palsy 
or other motor-impairing conditions. 


Peabody Picture Vocabulary Test-Ill 


The Peabody Picture Vocabulary Test-III (PPVT- 
III) is the best known and most widely used of the 
nonreading, motor-reduced tests (Dunn & Dunn, 
1998). The PPVT-III is used to obtain a rapid mea- 
sure of listening vocabulary with persons who are 
deaf or who have neurological or speech impair- 
ments. Although the PPVT-III is useful with any 
examinee who cannot verbalize well, the test is es- 
pecially useful with examinees who also manifest 
motor-impairing conditions such as cerebral palsy 
or stroke. 

The PPVT-III comes in two parallel versions, 
each consisting of 4 practice plates and 204 testing 
plates. Each plate contains four line drawings of ob- 
jects or everyday scenes. The examiner presents a 
plate, states the stimulus word orally, and asks the 
examinee to point to the one picture that best depicts 
the stated word. The test items are precisely ordered 
according to difficulty level, arranged in 17 sets of 
12 items each for efficient identification of basal and 
ceiling levels. The entry level is determined by age, 
and examinees continue until they reach their ceil- 
ing level. Although the test is untimed, administra- 
tion seldom exceeds 15 minutes. Raw scores are 
converted to age equivalents or standard scores 
(mean of 100, standard deviation of 15). 

The PPVT-III was standardized on a represen- 
tative national sample of 2,725 individuals ranging 
from 2% to 90 or more years of age. Reliability data 
for the new edition are exceptionally strong, with 
typical internal consistency coefficients of .94, 
alternate-forms reliabilities of .94, and test-retest 
correlations of .92. Concurrent validity studies are 
also highly supportive, demonstrating robust cor- 
relations with verbal intelligence measures. For ex- 
ample, the test developers report correlations of .91 
with WISC-III Verbal IQ and .82 with K-BIT Vo- 
cabulary scores (Dunn & Dunn, 1998). 

The test developers of the PPVT-III took great 
care to minimize and balance cultural influences in 
the test items. Independent consultants represent- 
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ing the perspectives of African Americans, Asians, 
Hispanics, Native Americans, and women reviewed 
the content and artwork of the PPVT-II during de- 
velopment, and adjustments were made following 
these reviews. The test items demonstrate attractive 
artwork that is balanced for racial and gender dif- 
ferences, including persons with physical disabili- 
ties. However, the evidence is mixed as to whether 
the PPVT-III is a culturally fair instrument that 
serves as a valid measure with minority children. 
For example, Washington and Craig (1999) found 
that 59 African American preschoolers at risk for 
academic failure averaged 91 on the test (SD of 11), 
which was seen as commensurate with their envi- 
ronmental disadvantages. These authors laud the 
test as “culturally fair.” However, Campbell, Bell, 
and Keith (2001) reported an average score of 82 
(SD of 12) for 416 African American children of 
low socioeconomic status, which was 8 points 
lower than their overall score on the K-ABC. These 
researchers concluded: “Despite the attempts to re- 
duce racial differences, the PPVT-III appears to 
perform similarly to prior editions of the Peabody 
scales. On average, the PPVT-III tends to underes- 
timate both intellectual ability and scholastic 
achievement, as measured by the K-ABC, in low 
SES, African American children” (p. 91). Further 
research will be needed to clarify the utility of this 
test with minority children. 

Several lines of evidence support the validity 
of the Peabody test, but only as a narrow measure 
of vocabulary, not as a general measure of intelli- 
gence (Altepeter, 1989; Altepeter & Johnson, 1989). 
Dunn and Dunn (1981) sought to ensure content 
validity by searching Webster’s New Collegiate 
Dictionary for all words whose meanings could be 
represented by a picture. Thus, the authors had a 
specific content universe in mind, and the items 
from the Peabody appear to be a fair sampling from 
this domain. In addition, the authors used sophisti- 
cated item-selection techniques based on the Rasch- 
Wright latent-trait model to help build construct 
validity into the test. This model enables researchers 
to construct a growth curve for the latent trait being 
measured (hearing vocabulary) and to select items 
that best fit the curve. Using tryout and calibration 
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data, the curve was drawn repeatedly on acomputer. 
If an item did not fit the Rasch-Wright latent-trait 
model (too flat or too steep an item-characteristic 
curve) it was discarded from consideration. 

Using a sophisticated structural equation model, 
Miller and Lee (1993) demonstrated that an earlier 
edition, the PPVT-R, can be assumed to reflect true 
developmental level of vocabulary. These re- 
searchers were able to predict rank order of the 
PPVT-R stimulus words reasonably well based upon 
complex word characteristics (date of entry into the 
English language, word length, number of separate 
meanings, frequency of occurrence). The predictor 
variables provided a reasonable theoretical account 
of the word ordering in the PPVT-R; that is, they 
confirmed the construct validity of the test. 

Concurrent and predictive validity data for the 
Peabody are somewhat limited, but promising. Sev- 
eral investigators have correlated the PPVT-R with 
achievement measures, where modest relation- 
ships (r’s from .30 to .60) are common (Naglieri, 
1981; Naglieri & Pfeiffer, 1983). Correlations with 
reading achievement tend to be higher than with 
spelling and arithmetic achievement, suggesting 
that the PPVT-R has appropriate discriminant va- 
lidity (Vance, Kitson, & Singer, 1985). 

Several investigators have correlated earlier 
versions of the Peabody with intelligence mea- 
sures, particularly the WISC-R and WAIS-R, and 
healthy correlations (near .70) are the rule (e.g., 
Haddad, 1986; Naglieri & Yazzie, 1983). As might 
be expected, correlations tend to be higher with 
Verbal IQ than Performance IQ. 

In a very important and ingenious study, Max- 
well and Wise (1984) investigated the vocabulary 
loading of the Peabody in a sample of 84 inpatients 
from psychiatry and psychology wards. Their study 
utilized the PPVT, but this earlier edition is similar 
to the PPVT-III, so that the conclusions are perti- 
nent here. The researchers investigated the hypoth- 
esis that the PPVT assesses more than vocabulary 
in adults. In addition to the PPVT, the researchers 
collected data on the following: WAIS-R, Wechsler 
Memory Scale, name-writing speed, and years of 
education. Name-writing speed is simply the num- 
ber of seconds required for the examinee to write 


his or her full name. Even though all variables had 
significant correlations with PPVT IQ, WAIS-R 
Vocabulary had by far the strongest correlation (r = 
.88). More important, when the variance accounted 
for by Vocabulary was removed, none of the re- 
maining variables had any predictive relationship 
with the PPVT. In short, the Peabody is a good 
measure of vocabulary (hearing vocabulary, in par- 
ticular) but could be misleading if used as a global 
measure of intellect. 

The PPVT-III is a recent revision, so indepen- 
dent research with the test is limited. One caution 
with the previous edition, the PPVT-R, is that stan- 
dard scores may be substantially lower than Wech- 
sler IQs, particularly with persons with mental 
retardation and minority examinees. In a sample of 
21 adults with mild mental retardation, Prout and 
Schwartz (1984) found the PPVT-R standard scores 
(mean of 56) to be an average of 9 points lower than 
the WAIS-R IQ (mean of 65). Naglieri and Yazzie 
(1983) found a huge 26-point difference with a 
sample of Navajo Indian children, who averaged a 
standard score of 61 on the PPVT-R in contrast to 
WISC-R IQ of 87. On a similar note, with the 
PPVT-III, Bell, Lassiter, Matthews, and Hutchin- 
son (2001) found that the instrument tended to un- 
derestimate WAIS-III IQ scores of bright college 
students by about 10 points. 

Overall, we may conclude that the Peabody is 
a well-normed measure of hearing vocabulary that 
is useful with nonreading and motor-impaired ex- 
aminees. However, the instrument is not a substi- 
tute for a general intelligence test and PPVT-III 
scores may underestimate intellectual function- 
ing in some groups (e.g., minority children, high- 
functioning adults). 


TESTING PERSONS WITH 
VISUAL IMPAIRMENTS 


Many millions of American adults have some de- 
gree of visual impairment, including more than 
1 million individuals who are legally blind—a 
term used in determining eligibility for government 
benefits. This term applies to individuals with cen- 
tral visual acuity of 20/200 or less in the better eye 


(with correction) or to those with significant re- 
duction in their visual field to a diameter of 20 de- 
grees or less (Bradley-Johnson & Ekstrom, 1998). 
The number of children with visual impairment is 
substantially smaller, with only 0.4 percent of stu- 
dents between the ages of 6 and 21 years receiving 
special education services because of a vision prob- 
lem (U.S. Department of Education, 1992). In ad- 
dition to special arrangements in testing (see Topic 
2B, The Testing Process), individuals with visual 
impairment may require unique instruments for 
valid assessment. 

In assessing the intellectual functioning of the 
visually impaired, examiners have historically re- 
lied upon adaptations of the Stanford-Binet. The 
Hayes-Binet revision for testing those with visual 
impairment was based on the 1916 Stanford-Binet; 
this instrument has since undergone several revi- 
sions. The most recent adaptation is the Perkins- 
Binet (Davis, 1980). The Perkins-Binet retains 
most of the verbal items from the Stanford-Binet, 
but also adapts other items to a tactual mode. The 
Perkins-Binet possesses acceptable split-half relia- 
bility and shows high correlations with verbal 
scales of the WISC-R (Coveny, 1972; Teare & 
Thompson, 1982). The developers of the Perkins- 
Binet have acknowledged that visual problems 
exist on a continuum by developing separate norms 
for children with usable vision (Form U) and no us- 
able vision (Form N). 

Test developers have also succeeded in modi- 
fying the Wechsler Performance scales for use with 
individuals with visual impairments. The Haptic 
Intelligence Scale for the Adult Blind (HISAB) 
consists of six subtests, four of which resemble the 
Digit Symbol, Block Design, Object Assembly, and 
Picture Completion tests of the WAIS Performance 
scale (Shurrager, 1961; Shurrager & Shurrager, 
1964). The remaining two subtests consist of Bead 
Arithmetic, which involves the use of an abacus to 
solve arithmetic problems, and a Pattern Board, 
which requires the examinee to reproduce the pat- 
tern felt on a board that has rows of holes with pegs 
in them. The reliability of the HISAB is excellent 
and the authors provide normative data on a sam- 
ple of adults with visual impairment. Most encour- 
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aging of all, HISAB scores correlate .65 with the 
WAIS Verbal IQ (Shurrager & Shurrager, 1964). 
Although the HISAB is still manufactured and sold 
by Stoelting Company, unfortunately, the test has 
never been investigated empirically. A search of 
PsychINFO for research with this instrument did 
not locate a single article. 

Another interesting instrument is the Blind 
Learning Aptitude Test (BLAT), a tactile test for 
children from 6 to 16 years of age who are blind 
(Newland, 1971). The BLAT items are in bas-relief 
form, consisting of dots and lines similar to Braille. 
The items consist of six different types: recognition 
of differences, recognition of similarities, identifi- 
cation of progressions, identification of the miss- 
ing element in a 2 x 2 matrix, completion of a 
figure, and identification of the missing element in 
a 3 x 3 matrix. Most of the items were adapted 
from Raven’s Progressive Matrices and the Cattell 
Culture Fair Intelligence Test. The BLAT is stan- 
dardized on 760 children with visual impairment, 
but the norms are outdated and the test manual is 
incomplete and somewhat slipshod (Herman, 
1988). Nonetheless, the test possesses exceptional 
reliability and correlates very well with the Hayes- 
Binet (r= .74) and the WISC Verbal scale (r = .71). 
The BLAT also shows strong correlations with 
Braille oral reading speed and comprehension 
(Baker, Koenig, & Sowell, 1995). In conjunction 
with a verbal test, the BLAT is a promising instru- 
ment for testing the intelligence of children with vi- 
sual disabilities. However, the test would profit 
substantially from minor revisions, updated norms, 
and a more thorough test manual. 


TESTING INDIVIDUALS WHO ARE 
DEAF OR HARD OF HEARING 


Upward of 1 million Americans are deaf or suffi- 
ciently hard of hearing that they rely upon Ameri- 
can Sign Language (ASL) as their primary means 
of communication (Brauer, Braden, Pollard, & 
Hardy-Braz, 1998). Given the typical limited mas- 
tery of the English language of persons who are 
deaf, and, vice versa, the typical psychologist’s lim- 
ited (or nonexistent) skill in ASL, the proper and 
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valid assessment of individuals who are deaf poses 
a profound cross-cultural challenge. 

More is involved than just picking a test devel- 
oped for, and normed upon, individuals who are 
deaf or hard of hearing and who use sign language. 
One problem is that sign language “can now be 
characterized on a multidimensional continuum en- 
compassing numerous styles, lexical variants, syn- 
tactic structures, dialects, and approximations to or 
departures from English word ordering” (Brauer et 
al., 1998, p. 299). Thus, a test developed in stan- 
dard ASL is not equally fair to all persons who are 
deaf, In general, the proper and valid assessment of 
persons who are deaf requires that interested psy- 
chologists immerse themselves in the Deaf culture 
and also seek relevant educational and training 
experiences: 


One especially needs a thorough understanding of 
the implications of deafness and the use of sign 
language for making diagnoses for people who are 
deaf. Few hearing psychologists have these skills. 
The push is for specialized training programs in 
deafness and psychology, a need that has been rec- 
ognized for decades. (Brauer et al., 1998, p. 303) 


If a consulting psychologist does not possess these 
skills, then the assessment of persons who are deaf 
should be referred to a person or agency with the 
requisite talents and expertise. 

The use of a sign language interpreter in the test- 
ing of persons who are deaf is a complicated and 
controversial matter. One concern is that the inter- 
preter may inadvertently alter the content of the test, 
therefore affecting the validity of the findings. Cer- 
tainly, it is unwise for parents or teachers to serve as 
interpreters. However, it is also true that persons 
who are deaf and who use sign language achieve 
higher IQs when the directions are signed than when 
they are delivered in the traditional manner (Braden, 
1992). The preferred resolution is for the examiner 
to be fluent in sign language, so that any necessary 
translations stay within the bounds of standardized 
procedure. 

For the intellectual assessment of persons who 
are deaf or hard of hearing, the Wechsler Perfor- 
mance subtests remain the tools of choice (Braden 


& Hannah, 1998). The impact of English language 
facility is minimized on these subtests, so it is 
thought that they provide a more accurate measure 
of cognitive skill than the Verbal subtests. Others 
tests sometimes used with persons who are deaf in- 
clude Raven’s Progressive Matrices (Raven, Court, 
& Raven, 1992) and the Hiskey-Nebraska Test of 
Learning Aptitude, discussed previously. The 
WAIS-II is now available in a formal ASL transla- 
tion (demonstrated on videotape), endorsed and 
disseminated by the test publisher (Kostrubala & 
Braden, 1998). 


ASSESSMENT OF ADAPTIVE 
BEHAVIOR IN MENTAL RETARDATION 


The assessment of mental retardation is a complex 
and multifaceted concern that rightfully deserves a 
chapter or book on its own. Owing to space 
limitations, our coverage is necessarily abridged; 
interested readers are referred to American As- 
sociation on Mental Retardation (2002), Nihira 
(1985), and Sattler (1988, chaps. 15 and 21). Here, 
we briefly summarize the diagnostic criteria for 
mental retardation, then review two contrasting as- 
sessment instruments in modest detail. We close 
with a tabular summary of several prominent mea- 
sures of adaptive behavior. 


Definition of Mental Retardation 


The most authoritative source for the definition of 
mental retardation is the manual of terminology 
and classification of the American Association on 
Mental Retardation (AAMR, 2002). This manual 
defines mental retardation as follows: 


Mental retardation refers to substantial limitations 
in present functioning. It is characterized by signif- 
icantly subaverage intellectual functioning, existing 
concurrently with related limitations in two or 
more of the following applicable adaptive skill 
areas: communication, self-care, home living, so- 
cial skills, community use, self-direction, health 
and safety, functional academics, leisure, and work. 
Mental retardation manifests before age 18. 
(AAMR, 2002) 


The manual further specifies that significantly 
subaverage intellectual functioning is an IQ of 70 
to 75 or below on scales with a mean of 100 and a 
standard deviation of 15. On tests such as the Stan- 
ford-Binet: Fourth Edition that possess a standard 
deviation of 16, the approximate range for retarded 
intellectual functioning would be an IQ of 68 to 73 
or below. The manual also explicitly affirms the im- 
portance of professional judgment in individual 
cases. 

A low IQ by itself is an insufficient foundation 
for the diagnosis of mental retardation. The AAMR 
definition also specifies a second criterion, that of 
limitations in two or more of the relevant adaptive 
skill areas. A diagnosis of mental retardation is 
warranted only when an individual displays a suf- 
ficiently low IQ and limitations in adaptive skill. 
Further, these deficits in intellect and adaptive 
functioning must have arisen during the develop- 
mental period—defined as between birth and the 
eighteenth birthday. 

This most recent AAMR manual represents a 
departure from previous terminology, which rec- 
ognized four levels of retardation: mild, moderate, 
severe, and profound. Instead of focusing upon the 
shortcomings of the person, the manual introduces 
a hierarchy of “Intensities of Needed Supports,” 
which redirects attention to the rehabilitation needs 
of the client. The four levels of needed supports are 
intermittent, limited, extensive, and pervasive. 
However, the previous terminology referring to lev- 
els of retardation will likely prevail for quite some 
time, so we have chosen to blend the old and the 
new approach in Table 7.1. The reader will notice 
a zone of uncertainty between levels of retardation, 
which signifies that clinical judgment about all 
sources of information is required in diagnosis. 
Furthermore, even though these levels are cali- 
brated by IQ ranges, we remind the reader that the 
examinee must also show corresponding deficit in 
two or more areas of adaptive skill. Under no cir- 
cumstances is an IQ test a sufficient basis for diag- 
nosing mental retardation. 

Limitations in adaptive skill are more difficult 
to confirm than a low IQ. The AAMR manual lists 
10 different areas of adaptive skill and specifies that 


TOPIC 7A TESTING SPECIAL POPULATIONS 233 


TABLE 7.1 Four Levels of Mental Retardation 


Mild Mental Retardation: IQ of 50-55 to 70-75+, In- 
termittent Support required. Reasonable social and 
communication skills; with special education, attain 6th 
grade level by late teens; achieve social and vocational 
adequacy with special training and supervision; partial 
independence in living arrangements. 

Moderate Mental Retardation: IQ of 35-40 to 50-55, 
Limited Support required. Fair social and communica- 
tion skills but little self-awareness; with extended spe- 
cial education, attain 4th grade level; function in a 
sheltered workshop but need supervision in living 
arrangements. 


Severe Mental Retardation: IQ of 20-25 to 35-40, Ex- 
tensive Support required. Little or no communication 
skills; sensory and motor impairments; do not profit 
from academic training; trainable in basic health habits. 
Profound Mental Retardation: IQ below 20-25, 
Pervasive Support required. Minimal functioning; 
incapable of self-maintenance; need constant nursing 
care and supervision. 








Source: Based on AAMR (2002) and Patton, Payne, and Beirne- 
Smith (1986). 


the client must show substantial limitations in two 
or more of them: 


ə Communication 

e Self-care 

e Home living 

e Social skills 

e Community use 

+ Self-direction 

e Health and safety 

e Functional academics 
e Leisure 

e Work. 


As to how these limitations are to be assessed, the 
manual proposes that well-normed measures of 
adaptive skills are desirable, but the final determi- 
nation is always a matter of clinical judgment. 

A test developer faces major problems in cali- 
brating limitations in adaptive skill. About the only 
hard fact we have in this domain is that environ- 
mental expectations for adaptive behavior increase 
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sharply from birth through young adulthood. In ad- 
dition, the expression of adaptive behavior changes 
character throughout. In childhood, adaptive be- 
haviors may be reflected in’sensory-motor skills 
and facility with language. In adulthood, vocational 
attainment and social responsibility become im- 
portant. Just as with intellectual assessment, tools 
for appraising adaptive behavior must be carefully 
age-graded. 

The first standardized instrument for assessing 
adaptive behavior was the Vineland Social Maturity 
Scale (Doll, 1935, 1936). Somewhat simplistic and 
coarse-grained by modern standards, the original 
Vineland scale consisted of 117 discrete items 
arranged in a year-scale format. An informant famil- 
iar with the examinee would check off applicable 
items. From these results the examiner would calcu- 
late an equivalent social age, helpful in the diagnosis 
of mental retardation. Still a respected instrument, 
the Vineland has undergone several revisions and is 
now known as the Vineland Adaptive Behavior 
Scales (Sparrow, Balla, & Cicchetti, 1984). 

Since the release of the original Vineland scale, 
over 100 scales of adaptive behavior have been pub- 
lished (Nihira, 1985; Reschly, 1990; Walls, Werner, 
Bacon, & Zane, 1977). These instruments vary 
greatly in structure, intended purpose, ahd targeted 
population. Broadly speaking, we can distinguish 
two types of instruments designed for two different 
purposes. One group of mainly norm-referenced 
scales is used largely to assist in diagnosis and clas- 
sification. Another group of mainly criterion-refer- 
enced scales is used largely to assist in training and 
rehabilitation. We have chosen one representative in- 
strument from each group for more detailed analysis. 


Scales of Independent Behavior-Revised 


The Scales of Independent Behavior-Revised (SIB- 
R; Bruininks, Woodcock, Weatherman, & Hill, 
1996) is an ambitious, multidimensional measure 
of adaptive behavior that is highly useful in the as- 
sessment of mental retardation. The instrument 
consists of 259 adaptive behavior items organized 
into 14 subscales. The scale is completed with the 
help of a parent, caregiver, or teacher well ac- 


quainted with the examinee’s daily behaviors. For 
each subscale, the examiner reads a series of items 
and for each item records a score from 0 (never or 
rarely does task) to 3 (does task very well). A use- 
ful feature of the SIB-R is that examiners need a 
minimum of training and experience. Of course, a 
much higher level of competence is required to 
evaluate results and make decisions about place- 
ment or treatment. 

The 14 subscales of the SIB are arranged into 4 
clusters, as outlined in Table 7.2. In turn, these 4 
clusters constitute the Broad Independence Scale. 
Each subscale consists of a small number of dis- 
crete, developmentally ordered items. For example, 
the subscale on Eating and Meal Preparation has 19 
graded items, including spearing food with a fork, 
eating soup with a spoon, taking appropriate-sized 
portions, and preparing snacks that do not require 
cooking. For each subscale, items are administered 
until a predetermined ceiling is reached (e.g., 3 of 
5 consecutive items scored 0). 

Raw scores for a subtest are added to obtain a 
part score. The part scores for each cluster are then 
added to obtain the cluster score. The score for the 
Broad Independence Scale is derived from the four 
cluster scores. The subtest scores, cluster scores, 
and the Broad Independence score can then be con- 
verted to a variety of normative scores to permit 
comparison of the examinee’s performance with 
the performance of the national norming sample. 
The normative scales include age scores, percentile 
ranks, standard scores, stanines, and normal curve 
equivalents. 

A separate, unique part of the SIB-R also as- 
sesses maladaptive behavior by measuring the fre- 
quency and severity of problem behaviors. The 
Problem Behaviors Scale includes eight major cat- 
egories of personal and social maladjustment that 
could affect adaptive behavior: Hurtful to Self, 
Hurtful to Others, Destructive to Property, Dis- 
ruptive Behavior, Unusual or Repetitive Habits, 
Socially Offensive Behavior, Withdrawal or Inat- 
tentive Behavior, and Uncooperative Behavior. Ex- 
amples of problem behaviors are listed, and the 
respondent must indicate the behaviors displayed 
by the examinee. In addition, the respondent 
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TABLE 7.2 The Subscales and Clusters of the Scales 
of Independent Behavior-Revised 


1. Motor Skills 
Gross Motor—19 large muscle skills such as sitting without support or taking part in 
strenuous physical activities. 
Fine Motor—19 small muscle skills such as picking up small objects or assembling 
small objects. 

. Social and Communication Skills 
Social Interaction—18 skills requiring interaction with other people such as handing 
toys to others or making plans with friends to attend social activities. 
Language Comprehension—18 skills involving the understanding of spoken and 
written language such as looking toward a speaker or reading. 
Language Expression—20 tasks involving talking such as making sounds to get 
attention or explaining a written contract. 

. Personal Living Skills 
Eating and Meal Preparation—19 skills related to eating and meal preparation, rang- 
ing from drinking from a glass to planning a meal. 
Toileting—17 skills necessary to bathroom and toilet use. 
Dressing—18 skills related to dressing, ranging from holding out arms and legs while 
being dressed to arranging for clothing alterations. 
Personal Self-Care—16 tasks involved in basic grooming and health maintenance, for 
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example, washing hands and making a medical appointment. 
Domestic Skills—18 tasks needed to maintain a home, ranging from putting empty 


dishes in the sink to selecting appropriate housing. 
4. Community Living Skills 


Time and Punctuality—19 tasks involving time concepts and time management such 


as keeping appointments. 


Money and Value—20 skills related to money concepts, such as saving money and 


using credit. 


Work Skills—20 skills related to prevocational and work habits, for example, indicat- 


ing that an assigned task is completed. 


Home-Community Orientation—18 skills involved in getting around the home and 
neighborhood and traveling in the community, for example, locating a dentist. 





describes the one most serious behavior in each 
category and rates it according to frequency of oc- 
currence, severity, and typical management. 

The standardization of the SIB-R was well con- 
ceived and executed. The norm group consisted of 
2,182 persons sampled to reflect the 1990 census 
characteristics. The normative data cover persons 
from age 3 months to adults over age 80. An addi- 
tional sample of persons with mental retardation, 
learning or hearing disabilities, and behavior dis- 
orders was also tested. The value of the SIB-R was 
further strengthened by anchoring it to the norms 


for the Woodcock-Johnson Psycho-Educational 
Battery-Revised. The SIB-R is one component of 
this larger test battery, but can be used on its own. 

The reliability of the SIB-R is generally re- 
spectable, but somewhat variable from subscale to 
subscale and from one age group to another. The 
individual subscales tend to show split-half relia- 
bilities in the vicinity of 0.80; the four clusters have 
median composite reliabilities around 0.90; the 
Broad Independence Scale has a very robust re- 
liability in'the high .90s (Bruininks, Woodcock, 
Weatherman, & Hill, 1996). 
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Initial validity data for the SIB-R are very 
promising. For example, the mean scores of various 
samples of disabled and nondisabled subjects show 
confirmatory relationships: SIB-R scores are low- 
est among those persons known to be most severely 
impaired in learning and adjustment. For disabled 
examinees, SIB-R scores correlate very strongly 
with intelligence scores (in the .80s), whereas with 
nondisabled examinees, the relationship is mini- 
mal (Bruininks et al., 1996). 

In sum, the SIB-R is an excellent tool for pro- 
viding insights into an examinee’s current level of 
functioning in real-life situations in the home, 
school, and community settings. Although this in- 
strument does not have a one-to-one correspon- 
dence with the 10 areas of adaptive skill listed in 
the definition of mental retardation, there is sub- 
stantial similarity. For example, the following areas 
of AAMR-listed adaptive skills are well covered by 
subscales or clusters of the SIB-R: communication, 
self-care, home living, social skills, community 
use, health and safety, and work. The SIB-R or a 
similar instrument ranks as a mandatory supple- 
ment to individual intelligence testing in the diag- 
nosis and assessment of mental retardation. 


Independent Living Behavior 
Checklist (ILBC) 


The Independent Living Behavior Checklist 
(ILBC) is an extensive list of 343 independent liv- 
ing skills classified and presented in six categories: 
mobility, self-care, home maintenance and safety, 
food, social and communication, and functional 
academic (Walls, Zane, & Thvedt, 1979). Unlike 
most of the instruments discussed so far in this text, 
the ILBC is completely nonnormative. The sole 
purpose of the ILBC is to facilitate the training of 
the individual examinee in the skills required for 
independent living. For this purpose, a collection 
of carefully selected criterion-referenced skills 
works better than a group of norm-based scores. 
The ILBC focuses on what the examinee can do, 
not on how the examinee compares to other per- 
sons. An exact age range is not specified, but the 


instrument appears to be suitable for persons 16 
years of age through adulthood. 

For each skill, the ILBC specifies a condition, 
a behavior, and a standard. Table 7.3 lists a sample 
of ILBC items. The reader will notice that all three 
components (condition, behavior, and standard) are 
defined with enough precision that reasonable ob- 
servers would likely agree when a skill has been 
mastered. In fact, test-retest and interobserver 
agreement for ILBC skills range from .96 to a per- 
fect 1.00. 

The items within each ILBC category were 
carefully selected to encompass the important and 
relevant skills for independent living. Apparently, 
the authors succeeded in identifying essential 
skills, insofar as their instrument has a 100 percent 
overlap with another—initially unknown—check- 
list for independent living (Schwab, 1979). In ad- 
dition, the ILBC items were carefully ordered from 
easiest to hardest. When used on a continuing basis 
over a several-year training period, the ILBC thus 
provides a checklist of skills mastered and also fur- 
nishes guidance for further rehabilitation. 


Additional Measures of Adaptive Behavior 


We remind the reader that measures of adaptive be- 
havior vary greatly. Some scales are designed 
mainly for diagnosis, others for remediation. Some 
scales are useful with persons with severe and pro- 
found mental retardation who will never be em- 
ployed, others with individuals with mild mental 
retardation seeking vocational training. Some 
scales are useful exclusively with children, others 
with adults. These instruments are not inter- 
changeable, and the potential user must study their 
strengths and limitations carefully. 

. The Vineland Adaptive Behavior Scales 
(VABS; Sparrow, Balla, & Cicchetti, 1984) is the 
most widely used measure of adaptive behavior in 
existence. The instrument is the outcome of a major 
revision and restandardization of the Vineland So- 
cial Maturity Scale, originally published in 1935 by 
Edgar A. Doll. Based upon a semistructured inter- 
view with a caregiver or parent, the VABS provides 


TABLE 7.3 A Sampling of ILBC Items 
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Rubber Scraper 35 


Condition: Given a bow] containing ingredients, a pan, and a rubber scraper 
Behavior: Client pours the ingredients into the pan and scrapes the sides of the bowl 
Standard: Behavior within 2 minutes. No ingredients must be spilled. All ingredients 


must be removed from the bowl 


Compliments 30 


Condition: Given a role play or natural situation in which the client is complimented 
Behavior: Client accepts the compliment(s) (e.g., says “Thank you.”) 


Standard: In the role play or natural situation, all persons interviewed must indepen- 
dently state that the client accepted the compliment(s) politely and was not overly gra- 


cious or vain 


Address 38 


Condition: Given a piece of paper with an address of place located within 3 blocks of the 


client 


Behavior: Client finds the appropriate location with or without assistance 


Standard: Behavior within one hour. The appropriate location must be found. The loca- 
tion may be found by the client alone or by the client with assistance (e.g., asking direc- 


tions from others such as a policeman) 





Source: Reprinted with permission from Walls, R. T., Zane, T., & Thvedt, J. E. (1979). The Independent 
Living Behavior Checklist. Dunbar: West Virginia Research and Training Center. 


an evaluation in the following domains and subdo- 
mains: Communication (receptive, expressive, 
written), Daily Living Skills (personal, domestic, 
community), Socialization (interpersonal relation- 
ships, play and leisure time, coping skills), Motor 
Skills (gross, fine). 

The VABS is a widely respected instrument 
with good concurrent validity, including correla- 
tions in the range of .50 to .80 with the WISC-R 
and Stanford-Binet. However, some of the inter- 
view items require knowledge that the informants 
may not possess (e.g., whether a child says 100 
recognizable words). Silverstein (1986) faults the 
normative data, noting discontinuous jumps in 
standard scores from one age group to another. 
Even so, the Vineland continues to be a highly pop- 
ular test in clinical practice and research. 

The American Association on Mental Retarda- 
tion (AAMR) has developed several scales useful 
in the assessment of persons with cognitive limita- 


tions. We mention here just one of its products, the 
AAMR Adaptive Behavior Scales: Second Edition 
(Nihira, Leland, & Lambert, 1993). The residential 
and community version of this test, suitable for per- 
sons 18 to 80 years of age, is a psychometric tour 
de force that borders on overkill. The normative 
sample includes more than 4,000 persons with de- 
velopmental disabilities from 43 states, residing in 
the community or in residential settings. In addi- 
tion to assessing the appropriate behavioral do- 
mains (e.g., independent functioning, domestic 
activity, self-direction, responsibility), a notewor- 
thy feature of the instrument is the careful attention 
to maladaptive behaviors, which are evaluated in 
eight domains: 


e Violent and antisocial behavior 

e Rebellious behavior 

e Eccentric and self-abusive behavior 
e Untrustworthy behavior 
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e Withdrawal 

e Stereotyped and hyperactive behavior 
e Inappropriate body exposure 

« Disturbed behavior 


This scale has been extensively validated and 
clearly distinguishes persons independently classi- 
fied at different adaptive behavior levels. 


SUMMARY . 


1. In the 1970s, a renewed societal concern 
for the needs of persons with disabilities was trans- 
lated into federal legislation. Public Law 93-112 
outlawed discrimination on the basis of disability. 
Public Law 94-142 mandated that disabled school- 
children receive appropriate assessment and edu- 
cational opportunities. 


2. Public Law 94-142, the Education for All 
Handicapped Children Act, has had dramatic 
impacts upon assessment practices with disabled 
persons. The law specifies nondiscriminatory as- 
sessment, native language testing, validated testing 
by appropriately trained personnel, and ancillary 
procedures, such as health, hearing, vision, and 
emotional assessment. 


3. Public Law 99-457 requires states to pro- 
vide a free appropriate public education to disabled 
children ages 3 through 5. Furthermore, the law 
mandates financial grants for states that serve dis- 
abled infants, toddlers, and their families. 


4. The Americans with Disabilities Act 
(ADA) of 1990 mandates that agencies must make 
reasonable testing accommodations for persons 
with disabilities, including such provisions as in- 
creased time on timed tests for persons with learn- 
ing disabilities and related disorders. 


5. The Leiter International Performance 
Scale-Revised is an untimed measure of perceptual 
organization and reasoning ability. The test can be 
administered completely by pantomime—the ex- 
aminee matches small laminated cards underneath 
corresponding illustrations on an easel display. 


6. The Goodenough-Harris Drawing Test is 
a brief screening test of intelligence in which the ex- 
aminee is encouraged to draw a good picture of a 
man. The 73 scorable items include body parts, de- 


tails, perspective, proportion, and implied freedom 
of movement. The Draw A Person test of Naglieri 
(1988) is an updated version of the Drawing Test. 


7. The Hiskey-Nebraska Test of Learning Ap- 
titude is a nonlanguage performance scale for use 
with children ages 3 to 17 years. The test is used 
with children who are deaf or bilingual or those 
who have speech or language impairment or men- 
tal retardation. Originally normed in 1960, the test 
is in need of restandardization. 


8. The Test of Nonverbal Intelligence-3 
(TONI-3) is a language-free multiple-choice mea- 
sure of cognitive ability designed for special popu- 
lations and carefully standardized for ages 5 
through 85. Most items require the examinee to 
identify relationships among abstract figures. The 
TONI-3 is a good index of general—as opposed to 
nonverbal—intelligence. 


9. The Peabody Picture Vocabulary Test-III 
(PPVT-III) is suitable for obtaining a rapid measure 
of hearing vocabulary with persons who are deaf or 
disabled (e.g., from stroke or cerebral palsy). The 
examiner says a word and the examinee tries to se- 
lect from four pictures the one that depicts the 
word. 


10. Respected tests for subjects with visual im- 
pairment include the Perkins-Binet, an adaptation 
of the Stanford-Binet; the Haptic Intelligence Scale 
for the Adult Blind (HISAB), a modification of the 
Wechsler Performance subtests; and the Blind 
Learning Aptitude Test (BLAT), a Braille-like mea- 
sure of concept formation and abstract reasoning. 


11. Testing persons who are deaf, especially 
those who rely upon sign language, requires spe- 
cialized training and sensitivity to issues of Deaf cul- 
ture. The Performance subtests of the Wechsler 


scales remain the tools of choice. A formal ASL 
translation of the WAIS-III (with videotaped demon- 
stration) has been released by the test publisher. 


12. Mental retardation is defined by three cri- 
teria: significantly subaverage general intellectual 
functioning, typically defined as an IQ under 70 (or 
75 in exceptional cases); limitations in two or more 
adaptive skill areas; and onset prior to the eigh- 
teenth birthday. 
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13. The Scales of Independent Behavior-R 
(SIB-R) is a measure of adaptive behavior that is 
highly useful in the assessment of mental retarda- 
tion. A parent, caregiver, or teacher completes a se- 
ries of 14 subscales pertaining to motor skills, 
social and communication skills, personal living 
skills, and community living skills. 


KEY TERMS AND CONCEPTS 


Public Law 93-112 p.219 
Public Law 94-142 p. 220 
Public Law 99-457 p. 221 


Americans with Disabilities Act p. 221 
legally blind p. 230 
mental retardation p. 232 


Torıc 7B Test Bias and Other Controversies 


The Question of Test Bias 
Social Values and Test Fairness 

Genetic and Environmental Determinants of Intelligence 
Origins of African American and White IQ Differences 
Age Changes in Intelligence 
Generational Changes in Intelligence Test Scores 


Summary 
Key Terms and Concepts 


A: intelligence test is a neutral, inconsequen- 
ial tool until someone assigns significance 
to the results derived from it. Once meaning is at- 
tached to a person’s test score, that individual will 
experience many repercussions, ranging from su- 
perficial to life-changing. These repercussions will 
be fair or prejudiced, helpful or harmful, appropri- 
ate or misguided—depending upon the meaning at- 
tached to the test score. 

Unfortunately, the tendency to imbue intelli- 
gence test scores with inaccurate and unwarranted 
connotations is rampant. Laypersons and students 
of psychology commonly stray into one thicket of 
harmful misconceptions after another. Test results 
are variously overinterpreted or underinterpreted, 
viewed by some as a divination of personal worth 
but devalued by others as trivial and unfair. 

The purpose of this topic is to further clarify the 
meaning of intelligence test scores in the light of 
relevant behavioral research. Specifically, we will 
pursue five issues—some would say controver- 
sies—that bear on the meaning of intelligence test 
scores: 


¢ The question of test bias 

e Genetic and environmental effects on intelligence 
e Origins of IQ differences between African Amer- 
icans and caucasian Americans 

The fate of intelligence in middle and old age 
Generational changes in intelligence test scores 
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The underlying theme of this section is that in- 
telligence test scores are best understood within the 
framework of modern psychological research. The 
reader is warned that the research issues pursued 
here are complex, confusing, and occasionally con- 
tradictory. However, the rewards for grappling with 
these topics are substantial. After all, the meaning 
of intelligence tests is demarcated, sharpened, and 
refined entirely by empirical research. 


Il] THE QUESTION OF TEST BIAS 


Beyond a doubt, no practice in modern psychology 
has been more assailed than psychological testing. 
Commentators reserve a special and often vehe- 
ment condemnation for ability testing in particular. 
In his wide-ranging response to the hundreds of 
criticisms aimed at mental testing, Jensen (1980) 
concluded that test bias is the most common rally- 
ing point for the critics. In proclaiming test bias, 
the skeptics assert in various ways that tests are cul- 
turally and sexually biased so as to discriminate un- 
fairly against racial and ethnic minorities, women, 
and the poor. We cite here a sampling of verbatim 
criticisms (Jensen, 1980): 


e Intelligence tests are sadly misnamed because 
they were never intended to measure intelligence 
and might have been more aptly called CB (cul- 
tural background) tests. 
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¢ Persons from backgrounds other than the culture 
in which the test was developed will always be 
penalized. 

¢ There are enormous social class differences in a 
child’s access to the experiences necessary to ac- 
quire the valid intellectual skills. 

¢ IQ scores reported for African Americans and 
low socioeconomic groups in the United States 
reflect characteristics of the test rather than of the 
test takers. 

¢ The poor performance of African American chil- 
dren on conventional tests is due to the biased 
content of the tests; that is, the test material 
is drawn from outside the African American 
culture. 

¢ Women are not so good as men at mathematics 
only because women have not taken as much 
math in high school and college. 


Are these criticisms valid? The investigation of 
this question turns out to be considerably more 
complicated than the reader might suppose. A most 
important point is that appearances can be deceiv- 
ing. As we will explain subsequently, the fact that 
test items “look” or “feel” preferential to one race, 
sex, or social class does not constitute proof of test 
bias. Test bias is an objective, empirical question, 
not a matter of personal judgment. 

Although critics may be loath to admit it, dis- 
passionate and objective methods for investigating 
test bias do exist. One purpose of this section is to 
present these methods to the reader. However, an 
aseptic discussion of regression equations and sta- 
tistical definitions of test bias would be incomplete, 
only half of the story. Conceptions of test bias are 
irretrievably intermingled with notions of test fair- 
ness. A full explanation of the story surrounding the 
test-bias controversy requires that we investigate 
the related issue of test fairness, too. 

Differences in terminology abound in this area, 
so it is important to set forth certain fundamental 
distinctions before proceeding. Test bias is a tech- 
nical concept amenable to impartial analysis. The 
most salient methods for the objective assessment 
of test bias are discussed in the following. In con- 
trast, test fairness reflects social values and philoso- 
phies of test use, particularly when test use extends 


to selection for privilege or employment. Much of 
the passion that surrounds the test-bias controversy 
stems from a failure to distinguish test bias from 
test fairness. To avoid confusion, it is crucial to 
draw a sharp distinction between these two con- 
cepts. We include separate discussions of test bias 
and test fairness, beginning with an analysis of why 
test bias is such a controversial topic. 


The Test-Bias Controversy 


The test-bias controversy has its origins in the ob- 
served differences in average IQ among various 
racial and ethnic groups. For example, African 
Americans score, on average, about 15 points lower 
than white Americans on standardized IQ tests. 
This difference reduces to 7 to 12 IQ points when 
socioeconomic disparities are taken into account. 
The existence of marked racial/ethnic differences 
in ability test scores has fanned the fires of contro- 
versy over test bias. After all, employment oppor- 
tunities, admission to college, completion of a 
high-school diploma, and assignment to special ed- 
ucation classes are all governed, in part, by test re- 
sults. Biased tests could perpetuate a legacy of 
racial discrimination. Test bias is deservedly a topic 
of intense scrutiny by both the public and the test- 
ing professions. 

One possibility is that the observed IQ dispari- 
ties indicate test bias rather than meaningful group 
differences. In fact, most laypersons and even some 
psychologists would regard the magnitude of race 
differences in IQ as prima facie evidence that intel- 
ligence tests are culturally biased. This is an ap- 
pealing argument, but a large difference between 
defined subpopulations is not a sufficient basis for 
proving test bias. The proof of test bias must rest 
upon other criteria outlined in the following section. 

Racial and ethnic differences are not the only 
foundation for the test-bias controversy. Signifi- 
cant gender differences also exist on some ability 
measures, most particularly in the area of spatial 
thinking (Maccoby & Jacklin, 1974; Halpern, 
1986). In one study (Gregory, Alley, & Morris, 
1980), males outscored females on the spatial- 
reasoning component of the Differential Aptitude 
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Test by a full standard deviation. Such findings 
raise the possibility that spatial-reasoning tests 
may be biased in favor of males. But how can we 
know? When do test score differences between 
groups signify test bias? We begin by reviewing 
the criteria that should be used to investigate test 
bias of any kind, whether for race, gender, or any 
other defining characteristic. 


Criteria of Test Bias and Test Fairness 


The topic of test bias has received wide attention 
from measurement psychologists, test developers, 
journalists, test critics, legislators, and the courts. 
Cole and Moss (1998) underscore an unsettling 
consequence of the proliferation of views held on 
this topic, namely, concepts of test bias have be- 
come increasingly intricate and complex. Further- 
more, the understanding of test bias is made 
difficult by the implicit and often emotional as- 
sumptions—held even by scholars—that may lead 
honest persons to view the same information in dif- 
ferent ways. 

In part, disagreements about test bias are per- 
petuated because adversaries in this debate fail to 
clarify essential terminology. Too often, terms such 
as test bias and test fairness are considered inter- 
changeable and thrown about loosely, without def- 
inition. We propose that test bias and test fairness 
commonly refer to markedly different aspects of 
the test-bias debate. Careful examination of both 
concepts will provide a basis for a more reasoned 
discussion of this controversial topic. 

As interpreted by most authorities in this field, 
test bias refers to objective statistical indices that 
examine the patterning of test scores for relevant 
subpopulations. Although experts might disagree 
about nuances, on the whole there is a consensus 
about the statistical criteria that indicate when a test 
is biased. We will expand this point later, but we 
can provide the reader with a brief preview here: In 
general, a test is deemed biased if it is differentially 
valid for different subgroups. For example, a test 
would be considered biased if the scores from ap- 
propriate subpopulations did not fall upon the same 
regression line for a relevant criterion. 


In contrast to the narrow concept of test bias, 
test fairness is a broad concept that recognizes the 
importance of social values in test usage. Even a 
test that is unbiased according to the traditional 
technical criteria of homogeneous regression might 
still be deemed unfair because of the social conse- 
quences of using it for selection decisions. The crux 
of the debate is this: Test bias (a statistical concept) 
is not necessarily the same thing:as test fairness (a 
values concept). Ultimately, test fairness is based 
on social conceptions such as one’s image of a just 
society. In the assessment of test fairness, subjec- 
tive values are of overarching importance; the sta- 
tistical criteria of test bias are merely ancillary. We 
will return to this point later when we analyze the 
link between social values and test fairness. But let 
us begin with a traditional presentation of technical 
criteria for test bias. 


The Technical Meaning 
of Test Bias: A Definition 


One useful way to examine test bias is from the 
technical perspective of test validation. The reader 
will recall from an earlier chapter that a test is valid 
when a variety of evidence supports its utility and 
when inferences derived from it are appropriate, 
meaningful, and useful. One implication of this 
viewpoint is that test bias can be equated with dif- 
ferential validity for different groups: 


Bias is present when a test score has meanings 

or implications for a relevant, definable subgroup 
of test takers that are different from the meanings 
or implications for the remainder of the test 
takers. Thus, bias is differential validity of a given 
interpretation of a test score for any definable, 
relevant,subgroup of test takers. (Cole & Moss, 
1998) 


Perhaps a concrete example will help clarify this 
definition. Suppose a simple word problem arith- 
metic test were used to measure youngsters’ addi- 
tion skills. The problems might be of the form “If 
you have two six-packs of pop, how many cans do 
you have altogether?” Suppose, however, the test is 
used in a group of primarily Spanish-speaking sev- 
enth graders. With these children, low scores might 
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indicate a language barrier, not a problem with 
arithmetic skills. In contrast, for English-speaking 
children low scores would most likely indicate a 
deficit in arithmetic skills. In this example, the test 
has differential validity, predicting arithmetic 
deficits quite well for English-speaking children, 
but very poorly for Spanish-speaking children. Ac- 
cording to the technical perspective of test valida- 
tion, we would conclude that the test is biased. 
Although the general definition of test bias 
refers to differential validity, in practice the particu- 
lar criteria of test bias fall under three main head- 
ings: content validity, criterion-related validity, and 
construct validity. We will review each of these 
categories, discussing relevant findings along the 
way. The coverage is illustrative, not exhaustive. In- 
terested readers should consult Jensen (1980), Cole 
and Moss (1998), and Reynolds and Brown (1984b). 


Bias in Content Validity 


Bias in content validity is probably the most com- 
mon criticism of those who denounce the use of 
standardized tests with minorities (Helms, 1992; 
Hilliard, 1984; Kwate, 2001). Typically, critics rely 
upon their own expert judgment when they ex- 
pound one or more of the following criticisms of 
the content validity of ability tests: 


1. The items ask for information that ethnic mi- 
nority or disadvantaged persons have not had 
equal opportunity to learn. 

2. The scoring of the items is improper, since the 
test author has arbitrarily decided on the only 
correct answer and ethnic minorities are inap- 
propriately penalized for giving answers that 
would be correct in their own culture but not that 
of the test maker. 

3. The wording of the questions is unfamiliar, and 
an ethnic minority person who may “know” the 
correct answer may not be able to respond be- 
cause he or she does not understand the question 
(Reynolds, 1998). 


Any of these criticisms, if accurate, would consti- 
tute bona fide evidence of test bias. However, 
merely stating a criticism does not comprise proof. 


Where these criticisms fall short is that they are sel- 
dom buttressed by empirical evidence. 

Reynolds (1998) has offered a definition of con- 
tent bias for aptitude tests that addresses the pre- 
ceding points in empirically defined, testable terms: 


An item or subscale of a test is considered to be 
biased in content when it is demonstrated to be rel- 
atively more difficult for members of one group 
than another when the general ability level of the 
groups being compared is held constant and no rea- 
sonable theoretical rationale exists to explain group 
differences on the item (or subscale) in question. 


This definition is useful because it proposes anem- 
pirical approach to the question of test bias. 

In general, attempts to prove that expert-nomi- 
nated items are culturally biased have not yielded 
the conclusive evidence that critics expect. McGurk 
(1953a, 1953b, 1967, 1975) has written extensively 
on this topic, and we will use his classic study to il- 
lustrate this point. For his doctoral dissertation, 
McGurk asked a panel of 78 judges (professors, ed- 
ucators, and graduate students in psychology and 
sociology) to classify each of 226 items from well- 
known standardized tests of intelligence into one of 
three categories: least cultural, neutral, most cul- 
tural. McGurk administered these test items to hun- 
dreds of high school students. His primary analysis 
involved the test results for 213 African American 
students and 213 white students matched for cur- 
riculum, school, length of enrollment, and socio- 
economic background. 

McGurk (1953a, 1953b) discovered that the 
mean difference between African American and 
white students for the total hybrid test, expressed in 
standard deviation units, was .50. More pertinent to 
the topic of test bias in content validity was his com- 
parison of scores on the 37 “most cultural” items 
versus the 37 “least cultural” items. For the “most 
cultural” items—the ones nominated by the judges 
as highly culturally biased—the difference was .30. 
For the “least cultural” items—the ones judged to be 
more fair to African Americans and other cultural 
minorities—the difference was .58. In other words, 
the items nominated as most cultural were relatively 
easier for African Americans; the items nominated 
as least cultural were relatively harder. This finding 
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held true even after item difficulty was partialed out. 
Furthermore, the item difficulties for the two groups 
were almost perfectly correlated (r = .98 for “most 
cultural” and r = .96 for “least cultural” items). 
There is an important lesson here that test critics 
often overlook: “Expert” judges cannot identify 
culturally biased test items based on an analysis of 
item characteristics. Recent studies continue to reaf- 
firm this conclusion (Reynolds, Lowe, & Saenz, 
1999). 

In general, with respect to well-known stan- 
dardized tests of ability and aptitude, research has 
not supported the popular belief that the specific 
content of test items is a source of cultural bias 
against minorities. This conclusion does not exon- 
erate these tests with respect to other criteria of test 
bias, discussed in the following sections. Further- 
more, we can point out that savvy test developers 
should be vigilant even to the impression of bias in 
test content, since the appearance of unfairness can 
affect public attitudes about psychological tests in 
quite tangible ways. 


Bias in Predictive or Criterion-Related Validity 


The prediction of future performance is one im- 
portant use of intelligence, ability, and aptitude 
tests. For this application of psychological testing, 
predictive validity is the most crucial form of va- 
lidity in relation to test bias. In general, an unbiased 
test will predict future performance equally well for 
persons from different subpopulations. For exam- 
ple, an unbiased scholastic aptitude test will predict 
future academic performance of African Americans 
and white Americans with near-identical accuracy. 

Reynolds (1998) offers a clear, direct definition 
of test bias with regard to criterion-related or pre- 
dictive validity bias: 


A test is considered biased with respect to predic- 
tive validity if the inference drawn from the test 
score is not made with the smallest feasible random 
error or if there is constant error in an inference or 
prediction as a function of membership in a partic- 


ular group. 


This definition of test bias invokes what might be 
referred to as the criterion of homogeneous regres- 


sion. According to this viewpoint, a test is unbiased 
if the results for all relevant subpopulations cluster 
equally well around a single regression line. In 
order to clarify this point, we need to introduce con- 
cepts relevant to simple regression. The discussion 
is modeled after Cleary, Humphreys, Kendrick, and 
Wesman (1975). 

Suppose we are using a scholastic aptitude test 
to predict first-year grade point average (GPA) in 
college. In the case of a simple regression analysis, 
prediction of future performance is made from an 
equation of the form: 


Y=bX+a 


where Yis the predicted college GPA, X is the score 
on the aptitude test, and b and a are constants de- 
rived from a statistical analysis of test scores and 
grades of prior students. We will not concern our- 
selves with how b and a are derived; the reader can 
find this information in any elementary statistics 
textbook. 

The values of b and a correspond to important 
aspects of the regression line—the straight line that 
facilitates the most accurate prediction of the crite- 
rion (college grades) from the predictor (aptitude 
score) (Figure 7.3). In particular, b corresponds to 
the slope of the line, with higher values of b indi- 
cating a steeper slope and more accurate prediction. 
The value of a depicts the intercept on the vertical 
axis. The units of measurement for b and a cannot 
be specified in advance because they depend upon 
the underlying scales used for X and Y. Notice 
in Figure 7.3 that the regression line is the refer- 
ence for predicting grades from observed aptitude 
score. 

According to the criterion of homogeneous re- 
gression, in an unbiased test a single regression line 
can predict performance equally well for all rele- 
vant subpopulations, even though the means for the 
different groups might differ. For example, in Fig- 
ure 7.4 group A performs better than group B on 
both predictor and criterion. Yet, the relationship 
between aptitude score and grades is the same for 
both groups. In this hypothetical instance, the graph 
depicts the absence of bias on the aptitude test with 
respect to criterion-related validity. 
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Note: The dotted line shows how the regression line can be used to predict grade point average 


from the test score for a single, new subject. 
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FIGURE 7.3 

Test Scores, Grades, and 
Regression Line for a 
Hypothetical Large Group 
of College Students 
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A more complicated situation known as inter- 
cept bias is shown in Figure 7.5. In this case, scores 
for the two groups do not cluster tightly around 
the single best regression line shown as a dotted 
line in the graph. Separate, parallel regression lines 
(and therefore separate regression equations) 


FIGURE 7.4 

Test Scores, Grades, and Single 
Regression Line for Two Hypo- 
thetical Large Subpopulations 
of College Students 


700 


would be needed to facilitate accurate prediction. If 
a single regression line were used (the dotted 
line), criterion scores for group A would be over- 
predicted, whereas criterion scores for group B 
would be underpredicted. Thus, the use of a single 
regression line would constitute a clear instance of 
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College Grade Point Average 


FIGURE 7.5 
Test Scores, Grades, and Paral- 200 
lel Regression Lines for Two 
Hypothetical Large Subpopula- 
tions of College Students 


test bias, because the test has differential predictive 
validity for different subgroups. ! 

But what about using separate regression lines 
for each subgroup? Would this solve the problem 
and rescue the test from criterion-related test bias? 
Opinions differ on this point. Although there is no 
doubt that separate regression equations would 
maximize predictive accuracy for the combined 
sample, whether this practice would produce test 
fairness is debated. We return to this issue later, 
when we discuss the relevance of socia! values to 
test fairness. 

The Scholastic Aptitude Test (now known as the 
Scholastic Assessment Test and discussed in a later 
chapter) has been analyzed by several researchers 
with regard to test bias in criterion-related validity 
(Cleary, Humphreys, Kendrick, & Wesman, 1975; 
Breland, 1979; Manning & Jackson, 1984). A con- 
sistent finding is that separate, parallel, regression 
lines are needed for African American and white 
examinees. For example, in one school the best 


1, Contrary to widely held belief, test bias in these cases actu- 
ally favors the lower-scoring group because its performance on 
the criterion is overpredicted. On occasion, then, test bias can 
favor minority groups. 
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regression equations for African American, white, 
and combined students were as follows: 


African American: Y= .055 + .0024V + .0025M 
White: Y = .652 + .0026V + .0011M 


Combined: Y = .586 + .0027V + .0012M 


where Y is the predicted college grade point, V is 
the SAT Verbal score, and M is the SAT Mathe- 
matics score (Cleary et al., 1975, p. 29). The effect 
of using the white or the combined formula is to 
overpredict college grades for African American 
subjects based on SAT results. On the traditional 
four-point scale (A = 4, B = 3, etc.), the average 
amount of overprediction from 17 separate studies 
was .20 or one-fifth of a grade point (Manning & 
Jackson, 1984). What these results mean is open to 
debate, but it seems clear, at least, that the SAT and 
similar entrance examinations do not underpredict 
college grades for minorities. 

The most peculiar regression outcome, known 
as slope bias, is depicted in Figure 7.6. In this case, 
the regression lines for separate subgroups are not 
even parallel. Using a single regression line (the dot- 
ted line) for prediction might therefore result in both 
under- and overprediction of scores for selected sub- 
jects in both groups. Professional opinion would be 
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unanimous in this case: This test possesses a high 
degree of test bias in criterion-related validity. 


Bias in Construct Validity 


The reader will recall that the construct validity of 
a psychological test can be documented by diverse 
forms of evidence, including appropriate develop- 
mental patterns in test scores, theory-consistent in- 
tervention changes in test scores, and confirmatory 
factor analysis. Because construct validity is such 
a broad concept, the definition of bias in construct 
validity requires a general statement amenable to 
research from a variety of viewpoints with a broad 
range of methods. Reynolds (1998) offers the fol- 
lowing definition: 


Bias exists in regard to construct validity when a 
test is shown to measure different hypothetical traits 
(psychological constructs) for one group than for 
another; that is, differing interpretations of a com- 
mon performance are shown to be appropriate as a 
function of ethnicity, gender, or another variable of 
interest, one typically but not necessarily nominal. 


From a practical standpoint, two straightfor- 
ward criteria for nonbias flow from this definition 
(Reynolds & Brown, 1984a). If a test is nonbiased, 
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Test Scores, Grades, and Non- 
parallel Regression Lines for 
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populations of College Students 
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then comparisons across relevant subpopulations 
should reveal a high degree of similarity for (1) the 
factorial structure of the test, and (2) the rank order 
of item difficulties within the test. Let us examine 
these criteria in more detail. 

An essential criterion of nonbias is that the fac- 
tor structure of test scores should remain invariant 
across relevant subpopulations. Of course, even 
within the same subgroup, the factor structure of a 
test might differ between age groups, so it is im- 
portant that we restrict our comparison to same- 
aged persons from relevant subpopulations. For 
same-aged subjects, a nonbiased test will possess 
the same factor structure across subgroups. In par- 
ticular, for a nonbiased test the number of emergent 
factors and the factor loadings for items or subscales 
will be highly similar for relevant subpopulations. 

In general, when the items or subscales of prom- 
inent ability and aptitude tests are factor-analyzed 
separately in white and minority samples, the same 
factors emerge in the relevant subpopulations (Rey- 
nolds, 1982; Jensen, 1980, 1984). Although minor 
anomalies have been reported in a handful of stud- 
ies (Scheuneman, 1987; Reschly, 1978; Gutkin & 
Reynolds, 1981; Johnston & Bolen, 1984), research 
in this area is more notable for its consistent findings 
with respect to factorial invariance across subgroups. 
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An entirely typical result comparing white and Mex- 
ican American children on the WISC-R (Geary & 
Whitworth, 1988) is reported in Table 7.4. 

A second criterion of nonbias in construct va- 
lidity is that the rank order of item difficulties 
within a test should be highly similar for relevant 
subpopulations. Since age is a major determinant 
of item difficulty, this standard is usually checked 
separately for each age group covered by a test. The 
reader should note what this criterion does not 
specify. It does not specify that relevant subgroups 
must obtain equivalent passing rates for test items. 
What is essential is that the items that are the most 
difficult (or least difficult) for one subgroup should 
be the most difficult (or least difficult) for other rel- 
evant subpopulations. 

The criterion of similar rank order of item dif- 
ficulties can be tested in a very straightforward and 
objective manner. If the difficulty level of each item 
is computed by means of the p value (percentage 
passing) for each relevant subpopulation, then it is 
possible to compare the relative item difficulties 


across same-aged subgroups. In fact, the similarity 
of the rank order of item difficulties for any two 
groups can be gauged objectively by means of a 
correlation coefficient (7). The paired p values for 
the test items constitute the values of x and y used 
in the computation. The closer the value of r to 
1.00, the more similar the rank ordering of item dif- 
ficulties for the two groups. 

In general, cross-group comparisons of relative 
item difficulties for prominent aptitude and ability 
tests have yielded correlations bordering on 1.00; 
that is, most tests show extremely similar rank or- 
derings for item difficulties across relevant sub- 
populations (Jensen, 1980; Reynolds, 1982). In a 
representative study, Miele (1979) investigated the 
relative item difficulties of the WISC for African 
American and white subjects at each of four grade 
levels (preschool, first, third, and fifth grades). He 
found that the average cross-racial correlations 
(holding grade level constant) for WISC item p val- 
ues was .96 for males and .95 for females. These 
values were hardly different from the cross-sex 


TABLE 7.4 Comparative Factor Analysis of the WISC-R for 100 
Anglo-American and 78 Mexican American Children 


Factor Loadings 
Verbal Perceptual Freedom from 
Subtest Comprehension Organization Distractibility 
Ang M-A Ang M-A Ang M-A 
Information whol, by Si 
Similarities a ‚85 
Arithmetic 73 75 
Vocabulary .86 .86 
Comprehension .76 .76 
Picture Comp. 58 58 
Picture Arr. 77 61 
Block Design .72 61 
Object Assem. .72 .72 
Coding 52 52 





Note: Ang = Anglo-American, N = 100. M-A = Mexican American, N = 78. 


Source: Adapted with permission from Geary, D. C., & Whitworth, R. H. (1988). Is the factor structure of 
the WISC-R different for Anglo- and Mexican-American children? Journal of Psychoeducational 


Assessment, 6, 253-260. 
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correlations (holding grade level constant) within 
race, which were .98 (whites) and .97 (African 
Americans). As noted, these findings are not un- 
usual. In general, for prominent ability and aptitude 
tests, the rank order of item difficulties is virtually 
identical for all relevant subpopulations. 


Reprise on Test Bias 


Critics who hypothesize that tests are biased 
against minorities assert that the test scores under- 
estimate the ability of minority members. As we 
have argued in the preceding sections, the hypoth- 
esis of test bias is a scientific question that can 
be answered empirically through such procedures 
as factor analysis, regression equations, intergroup 
comparisons of the difficulty levels for “biased” 
versus “unbiased” items, and rank ordering of item 
difficulties. In general, ability and aptitude tests 
fare quite well by these criteria. In fact, there is no 
domain of ability or aptitude testing in which 
there has been cumulative evidence suggesting 
test bias. To the contrary, extensive reviews of 
the empirical studies provide overwhelming evi- 
dence disconfirming the bias hypothesis (Reynolds, 
1982, 1994a; Jensen, 1980; Manning & Jackson, 
1984). 

We turn now to the broader concept of test fair- 
ness. How well do existing instruments meet rea- 
sonable criteria of test fairness? As the reader will 
learn, test fairness involves social values and is 
therefore an altogether more debatable—and more 
debated—topic than test bias. 


Il] SOCIAL VALUES AND TEST FAIRNESS 


Even an unbiased test might still be deemed unfair 
because of the social consequences of using it 
for selection decisions. In contrast to the narrow, 
objective notion of test bias, the concept of test 
fairness incorporates social values and philoso- 
phies of test use. We will demonstrate to the reader 
that, in the final analysis, the proper application of 
psychological tests is essentially an ethical con- 
clusion that cannot be established on objective 
grounds alone. 


In a classic article that deserves detailed scru- 
tiny, Hunter and Schmidt (1976) proposed the first 
clear distinction between statistical definitions of 
test bias and social conceptions of test fairness. Al- 
though the authors reviewed the usual technical cri- 
teria of test bias with incisive precision, their article 
is most famous for its description of three mutually 
incompatible ethical positions that can and should 
affect test use. 

Hunter and Schmidt (1976) noted that psycho- 
logical tests are often used for institutional selec- 
tion procedures such as employment or college 
admission. In this context, the application of test re- 
sults must be guided by a philosophy of selection. 
Unfortunately, in many institutions the selection 
philosophy is implicit, not explicit. Nonetheless, 
when underlying values are made explicit, three 
ethical positions can be distinguished. These posi- 
tions are unqualified individualism, quotas, and 
qualified individualism. Since these ethical stances 
are at the very core of public concerns about test 
fairness, we will review these positions in some 
detail. 


Unqualified Individualism 


In the American tradition of free and open compe- 
tition, the ethical stance of unqualified individu- 
alism dictates that, without exception, the best 
qualified candidates should be selected for em- 
ployment, admission, or other privilege. Hunter and 
Schmidt (1976) spell out the implications of this 
position: 


Couched in the language of institutional selection 
procedures, this means that an organization should 
use whatever information it possesses to make a 
scientifically valid prediction of each individual’s 
performance and always select those with the 
highest predicted performance.This position looks 
appealing at first glance, but embraces some impli- 
cations that most persons find troublesome. In par- 
ticular, if race, sex, or ethnic group membership 
contributed to valid prediction of performance in 

a given situation over and above the contributions 
of test scores, then those who espouse unqualified 
individualism would be ethically bound to use such 
a predictor. 
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Quotas 


The ethical stance of quotas acknowledges that 
many bureaucracies and educational institutions owe 
their very existence to the city or state in which they 
function. Since they exist at the will of the people, it 
can be argued that these institutions are ethically 
bound to act in a manner that is “politically appro- 
priate” to their location. The logical consequence of 
this position is quotas. For example, in a location 
whose population is one-third African American and 
two-thirds white, selection procedures should admit 
candidates in approximately the same ratio. A selec- 
tion procedure that deviates consistently from this 
standard would be considered unfair. 

By definition, fair share quotas are based ini- 
tially upon population percentages. Within relevant 
subpopulations, factors that predict future perfor- 
mance such as test scores would then be consid- 
ered. However, one consequence of quotas is that 
those selected do not necessarily have the highest 
scores on the predictor test. 


Qualified Individualism 


Qualified individualism is a radical variant of in- 
dividualism: 


This position notes that America is constitutionally 
opposed to discrimination on the basis of race, reli- 
gion, national origin, or sex. A qualified individual- 
ist interprets this as an ethical imperative to refuse 
to use race, sex, and so on, as a predictor even if it 
were in fact scientifically valid to do so. (Hunter & 
Schmidt, 1976) 


For selection purposes, the qualified individu- 
alist would rely exclusively upon tested abilities, 
without reference to age, sex, race, or other demo- 
graphic characteristics. This seems laudable, but 
examine the potential consequences. Suppose a 
qualified individualist used SAT scores for pur- 
poses of college admission. Even though SAT 
scores for African Americans and whites produce 
separate regression lines for the criterion of college 
grades, the qualified individualist would be ethi- 
cally bound to use the single, less-accürate regres- 
sion line derived for the entire sample of applicants. 


As a consequence, the future performance of 
African Americans would be overpredicted, which 
would seemingly boost the proportion of persons 
selected from this applicant group. With respect to 
selection ratios, the practical impact of qualified in- 
dividualism is therefore midway between quotas 
and unqualified individualism. 


Reprise on Test Fairness 


Which philosophy of selection is correct? The truth 
is, this problem is beyond the scope of rational so- 
lution. At one time or another, each of the ethical 
stances outlined previously has been championed 
by wise, respected, and thoughtful citizens. How- 
ever, no consensus has emerged, and one is not 
likely to be found soon. The dispute reviewed here 


is typical of ethical arguments—the resolution de- 
pends in part on irreconcilable values. Further- 
more, even among those who agree on values there 
will be disagreements about the validity of certain 
relevant scientific theories that are not yet ade- 
quately tested. Thus, we feel that there is no way 
that this dispute can be objectively resolved. Each 
person must choose as he sees fit (and in fact we 
are divided). (Hunter & Schmidt, 1976) 


When ethical stances clash—as they most certainly 
do in the application of psychological tests to se- 
lection decisions—the court system may become 
the final arbiter, as discussed later in this book. 


GENETIC AND ENVIRONMENTAL 
DETERMINANTS OF INTELLIGENCE 


Genetic Contributions to Intelligence 


The nature-nurture debate regarding intelligence is 
a well-known and overworked controversy that we 
will largely sidestep here. We concur with McGue, 
Bouchard, Iacono, and Lykken (1993) that a sub- 
stantial genetic component to intelligence has been 
proved by decades of adoption studies, familial re- 
search, and twin projects, even though individual 
studies may be faulted for particular reasons: 


When taken in aggregate, twin, family, and adop- 
tion studies of IQ provide a demonstration of the 
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existence of genetic influences on IQ as good as 
can be achieved in the behavioral sciences with 
nonexperimental methods. Without positing the ex- 
istence of genetic influences, it simply is not possi- 
ble to give a credible account for the consistently 
greater IQ similarity among monozygotic (MZ) 
twins than among like-sex dizygotic (DZ) twins, 
the significant IQ correlations among biological 
relatives even when they are reared apart, and the 
strong association between the magnitude of the fa- 
milial IQ correlation and the degree of genetic re- 
latedness. (p. 60) 


Of course, the demonstration of substantial genetic 
influence for a trait does not imply that heredity 
alone is responsible for differences between indi- 
viduals—environmental factors are formative, too, 
as reviewed subsequently. 

The genetic contribution to human characteris- 
tics such as intelligence (as measured by IQ tests) 
is usually measured in terms of a heritability index 
that can vary from 0.0 to 1.0. The heritability 
index is an estimate of how much of the total vari- 
ance in a given trait is due to genetic factors. Heri- 
tability of 0.0 means that genetic factors make no 
contribution to the variance in a trait, whereas her- 
itability of 1.0 means that genetic factors are ex- 
clusively responsible for the variance in a trait. Of 
course, for most measurable characteristics, heri- 
tability is somewhere between the two extremes. 
McGue et al. (1993) discuss the various methods 
for computing heritability based upon twin and 
adoption studies. 

It is important to stress that heritability is a pop- 
ulation statistic that cannot be extended to explain 
an individual score. Furthermore, heritability for a 
given trait is not a constant. As Jensen (1969) notes, 
estimates of heritability “are specific to the popu- 
lation sampled, the point in time, how the mea- 
surements were made, and the particular test used 
to obtain the measurements.” For IQ, most studies 
report heritability estimates right around .50, mean- 
ing that about half of the variability in IQ scores is 
from genetic factors. However, for some studies, 
the heritability of IQ is much higher, in the .70s 
(Bouchard, 1994; Bouchard, Lykken, McGue, 
Segal, & Tellegen, 1990; Pedersen, Plomin, Nes- 
selroade, & McClearn, 1992). 


Dozens of studies could be cited to demonstrate 
the importance of genetic factors in the determina- 
tion of intelligence as measured by IQ tests. The 
relevant literature on this question is almost legion. 
For a basic initiation to the topic, the reader can 
consult Bouchard (1994) and Jensen (1998). 

A most fascinating demonstration of the ge- 
netic contribution to IQ is found in the Minnesota 
Study of Twins Reared Apart (Bouchard et al., 
1990). In this ongoing study, identical twins reared 
apart are reunited for extensive psychometric test- 
ing. Bouchard (1994) reports that the IQs of iden- 
tical twins reared apart correlate almost as highly 
as those of identical twins reared together, even 
though the twins reared apart often were exposed 
to different environmental conditions (in some 
cases, sharply contrasting environments). In sum, 
differences in environment appeared to cause very 
little divergence in the IQs of identical twin pairs 
reared apart. These findings strongly corroborate a 
genetic contribution to intelligence, with heritabil- 
ity estimated in the vicinity of .70. 

We can further illustrate the general thrust of re- 
search in this area with a classic study, Honzik’s 
(1957) reanalysis of the data from the adoption 
study of Skodak and Skeels (1949). What this 
analysis showed is that when adopted children are 
repeatedly tested with an instrument such as the 
Stanford-Binet, their intelligence correlates more 
and more closely with the educational attainment 
of the biological parents; by age 8 the correlation 
stabilizes at a value of approximately 0.35 (Figure 
7.7). Interestingly enough, this is about the same 
level of correlation found between children’s IQ 
and parent educational attainment for intact fami- 
lies. In contrast, Honzik (1957) determined that the 
intelligence of the adopted children correlated near 
zero with the educational attainment of their adop- 
tive parents. The educational attainment of adults 
is a good proxy for their intelligence—the best es- 
timate available under the circumstances. The re- 
sults indicated that the intelligence of adoptive 
children parallels the intelligence of the biological 
parents—even in their absence—but showed no re- 
lationship to the intelligence of the adoptive par- 
ents—even in their constant, day-to-day presence. 
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What do these findings mean? We know that the bi- 
ological parents provided the genetic subsidy to 
their children’s intelligence, whereas the adoptive 
parents furnished the environmental underpinnings. 
The dispassionate observer will find it difficult to 
avoid an obvious conclusion: Adoption studies such 
as reported by Honzik (1957) corroborate a ro- 
bust—but not exclusive—genetic contribution to 
intelligence. 

However, we must avoid the tendency to view 
any corpus of research in a simplistic either/or 
frame of mind. Even the most die-hard hereditari- 
ans acknowledge that a person’s intelligence is 
shaped also by the quality of experience. The cru- 
cial question is: To what extent can enriched or de- 
prived environments modify intelligence upward or 
downward from the genetically circumscribed po- 
tential? The reader is reminded that the genetic con- 
tribution to intelligence is indirect, most likely via 
the gene-coded physical structures of the brain and 
nervous system. Nonetheless, the brain is quite 
malleable in the face of environmental manipula- 








Note: Data for mothers are quite similar. 
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tions, which can even alter its weight and the rich- 
ness of neuronal networks (Greenough, Black, & 
Wallace, 1987). How much can such environmen- 
tal impacts sway intelligence as measured by IQ 
tests? We will review several studies indicating that 
environmental extremes help determine intellectual 
outcome within a range of approximately 20 IQ 
points, perhaps more. 


Environmental Effects: 
Impoverishment and Enrichment 


First, we examine the effects of environmental dis- 
advantage. Vernon (1979, chap. 9) has reviewed the 
early studies of severe deprivation, noting that chil- 
dren reared under conditions in which they received 
little or no human contacts can show striking im- 
provements in IQ—as much as 30 to 50 points— 
when transferred to a more normal environment. 
Yet, we must regard this body of research with 
some skepticism, owing to the typically exceptional 
conditions under which the initial tests were ad- 
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ministered. Can a meaningful test be administered 
to 7-year-old children raised almost like animals 
(Koluchova, 1972)? 

Typical of this early research is the follow-up 
study by Skeels (1966) of 25 orphaned children orig- 
inally diagnosed as having mental retardation 
(Skeels & Dye, 1939). These children were first 
tested at approximately 1% years of age when living 
in a highly unstimulating orphanage. Thirteen of 
them were then transferred to another home where 
they received a great deal of supervised, doting at- 
tention from older girls with mental retardation. 
These children showed a considerable increase in IQ, 
whereas the 12 who remained behind decreased fur- 
ther in IQ. When traced at follow-up 26 years later, 
the 13 transferred cases were normal, self-supporting 
adults, or were married. The other subjects—the con- 
trast group— were still institutionalized or in menial 
jobs. The enriched group showed an average increase 
of 32 IQ points when retested with the Stanford- 
Binet, whereas the contrast group fell below their 
original scores. Even though we are disinclined to 
place much credence in the original IQ scores and 
might therefore quarrel with the exact magnitude of 
the change, the Skeels (1966) study surely indicates 
that the difference between a severely depriving early 
environment and a more normal one might account 
for perhaps 15 to 20 IQ points. 

A methodologically novel demonstration of en- 
vironmentally induced IQ deficit has been reported 
by Jensen (1977). He tested 653 white children and 
826 African American children from a small rural 
town in the southeastern part of Georgia. The work- 
ing hypothesis of his study was that older African 
American children should score lower than their 
younger siblings, owing to the cumulative, intellect- 
depressing effects of their bleak, profoundly de- 
prived environment. According to the cumulative 
deficit hypothesis, a consistent downward trend in 
IQ is a result of the cumulative effects of environ- 
mental disadvantages in factors related to mental de- 
velopment. In contrast, white children, who are less 
environmentally deprived, should not show a cumu- 
lative intellectual deficit as a linear function of age. 

All children were administered the California 
Test of Mental Maturity (CTMM, 1963 revision), a 


standardized test of general intelligence, as part of 
a state-mandated testing program. The CTMM 
yielded carefully standardized deviation IQ scores 
(national mean of 100, standard deviation of 15) 
computed separately from national norms for each 
grade level from kindergarten through grade 12. 
Jensen (1977) noted that the sampled populations, 
particularly the African American group, were not 
intended to be representative of the wider U.S. pop- 
ulation, white or African American: 


Blacks in the locality under study are probably as 
severely disadvantaged, educationally and econom- 
ically, as can be found anywhere in the United 
States. If an age decrement does not exist in this 
group, it would seem most doubtful that it could be 
found in any subpopulation within our borders. 


As predicted, older African American children 
scored lower than their younger brothers and sis- 
ters, the magnitude of the difference being directly 
related to the difference in age. In particular, 
African American children appeared to lose ap- 
proximately one IQ point a year, on average, be- 
tween the ages of 6 and 16, with the cumulative loss 
totalling 5 to 10 IQ points. The exact amount of the 
loss depends upon how we interpret some apparent 
sampling peculiarities in the data (Figure 7.8). 
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FIGURE 7.8 Average Difference in CTMM IQ 
between Younger and Older African American Siblings 
as a Function of the Difference in Age 

Source: Based on data from Jensen, A. R. (1977). Cumulative 
deficit in IQ of blacks in the rural south. Developmental Psychol- 
ogy, 13, 184-191. 
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Furthermore, if we factor in the probable IQ deficit 
that occurred between birth and age 5, we can sur- 
mise that the overall effect of a depriving environ- 
ment is substantially more than the 5- to 10-point 
IQ decrement reported by Jensen (1977). 

Scarr and Weinberg (1976, 1983) reversed the 
question probed by Jensen (1977), namely, they 
asked: What happens to their intelligence when 
African American children are adopted into the rel- 
atively enriched environment provided by econom- 
ically and educationally advantaged white 
families? As discussed later, it is well known that 
African American children reared by their own 
families obtain IQ scores that average about 15 
points below whites (Jensen, 1980). Some portion 
of this difference—perhaps all of it—is likely due 
to the many social, economic, and cultural differ- 
ences between the two groups. We put that issue 
aside for now. Instead, we pursue a related question 
that bears on the malleability of IQ: What differ- 
ence does it make when African American children 
are adopted into a more economically and educa- 
tionally advantaged environment? 

Scarr and Weinberg (1976, 1983) found that 130 
African American and interracial children adopted 
into upper-middle-class white families averaged a 
Full Scale IQ of 106 on the Stanford-Binet or the 
WISC, a full 6 points higher than the national aver- 
age and some 18 to 21 points higher than typically 
. found with African American examinees. African 
American children adopted early in life, before 1 
year of age, fared even better, with a mean IQ of 110. 
We can only wonder what the IQ scores would have 
been if the adoptions had taken place at birth, and if 
excellent prenatal care had been provided. This 
study indicates that when the early environment is 
optimal, IQ can be boosted by perhaps 20 points. 

In addition to enriching the psychological en- 
vironment, another approach to boosting IQ has 
been to enrich the nutritional status of children. An 
intriguing study by Schoenthaler, Amos, Eysenck, 
and others (1991) looked at the effects of controlled 
vitamin-mineral supplements on IQ in 615 high 
school students. Except that it employed an 
extremely tight experimental design with double- 
blind random assignment, this study would be easy 


to dismiss as a “crackpot” investigation of the im- 
plausible hypothesis that 12 weeks of vitamin- 
mineral enrichment will boost IQ. The researchers 
found that compared to a placebo control group, 
children who received 100 percent of the daily rec- 
ommended vitamin/mineral supplements showed 
no incremental gains in Verbal IQ, but a significant 
average boost of 3.7 points in Performance IQ on 
the WISC-R. This intriguing result is badly in need 
of replication and extension by other researchers. 
Lynn (1993) reviews additional studies, including 
a reported 9-point IQ gain in nonverbal reasoning 
among normal 12- to 13-year-old British children 
who received a vitamin-mineral supplement over 
an eight-month period. 

Limitations of space prevent us from further de- 
tailed discussion of environmental effects on IQ. It 
is worth noting, though, that a huge literature has 
emerged from early intervention and enrichment- 
stimulation studies of children at risk for school 
failure and mental retardation (e.g., Barnett & 
Camilli, 2002; Ramey & Ramey, 1998; Spitz, 
1986). In general, these studies show that interven- 
tion and enrichment can boost IQ in children at risk 
for school failure and mental retardation. Summa- 
rizing four decades of research, Ramey and Ramey 
(1998) extracted six principles from the research on 
early intervention for at-risk children. They refer to 
these as “remarkable consistencies in the major 
findings” on intervention studies: 


1. Interventions that begin earlier (e.g., during in- 
fancy) and continue longer provide the best ben- 
efits to participating children. 

2. More-intensive interventions (e.g., number of 
visits per week) produce larger positive effects 
than less-intensive interventions. 

3. Direct enrichment experiences (e.g., working di- 
rectly with the kids) provide greater impact than 
indirect experiences. 

4. Programs with comprehensive services (e.g., 
multiple enhancements) produce greater posi- 
tive changes than those with a narrow focus. 

5. Some children (e.g., those with normal birth- 
weight) show greater benefits from participation 
than other children. 
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6. Initial positive benefits diminish over time ifthe 
child’s environment does not encourage positive 
attitudes and continued learning. 


One concern about early intervention programs 
is their cost, which has been excessive for some of 
the demonstration projects. Skeptics wonder about 
the practicality and also the ultimate payoff of pro- 
viding extensive, broad-based, continuing interven- 
tion virtually from birth onward for the millions of 
children at risk for developmental problems. This is 
a realistic concern because “relatively few early 
intervention programs have received long-term 
follow-up” (Ramey & Ramey, 1998). Critics also 
wonder if the programs merely teach children how 
to take tests without affecting their underlying intel- 
ligence very much (Jensen, 1981). Finally, there is 
the issue of cultural congruence. Intervention pro- 
grams are mainly designed by white psychologists 
and then applied disproportionately to minority chil- 
dren. This a concern because programs need to be 
culturally relevant and welcomed by the consumers, 
otherwise the interventions are doomed to failure. 


Teratogenic Effects on 
Intelligence and Development 


In normal prenatal development, the fetus is pro- 
tected from the external environment by the pla- 
centa, a vascular organ in the uterus through which 
the fetus is nourished. However, some substances 
known as teratogens cross the placental barrier and 
cause physical deformities in the fetus. Especially 
if the deformities involve the brain, teratogens may 
produce lifelong behavioral disorders, including 
low IQ and mental retardation. The list of potential 
teratogens is almost endless and includes prescrip- 
tion drugs, hormones, illicit drugs, smoking, alco- 
hol, radiation, toxic chemicals, and viral infections 
(Berk, 1989; Martin, 1994). We will briefly high- 
light the most prevalent and also the most pre- 
ventable teratogen of all, alcohol. 

Heavy drinking by pregnant women causes 
their offspring to be at very high risk for fetal alco- 
hol syndrome (FAS), a specific cluster of abnor- 
malities first described by Jones, Smith, Ulleland, 


and Streissguth (1973). Intelligence is markedly 
lower in children with FAS. When assessed in ado- 
lescence or adulthood, about half of all persons 
with this disorder score in the range of mental re- 
tardation on IQ tests (Olson, 1994). Prenatal expo- 
sure to alcohol is one of the leading known causes 
of mental retardation in the Western world. The 
defining criteria of FAS include the following: 


1. Prenatal and/or postnatal growth retardation— 
weight below the tenth percentile after correct- 
ing for gestational age 

2. Central nervous system dysfunction—skull or 
brain malformations, mild to moderate mental 
retardation, neurological abnormalities, and be- 
havior problems 

3. Facial dysmorphology—widely spaced eyes, 
short eyelid openings, small up-turned nose, thin 
upper lip, and minor ear deformities (Clarren & 
Smith, 1978; Sokol & Clarren, 1989) 


The full-blown FAS previously described oc- 
curs mainly in offspring of women alcoholics— 
those who ingest many drinks per occasion. With 
lower levels of drinking, a more muted manifesta- 
tion of the syndrome known as fetal alcohol effect 
may arise. A child with fetal alcohol effect typically 
has a normal physical appearance, but exhibits 
demonstrably impaired attentional capacities and is 
slower to respond in a reaction time paradigm 
(Streissguth, Martin, Barr, & Sandman, 1984). Fur- 
thermore, the effect is linear-dose-related; that is, 
there may be no safe level of drinking during preg- 
nancy (Streissguth, Bookstein, & Barr, 1996). For 
this reason, physicians now routinely advise women 
to abstain from alcohol during pregnancy. Nonethe- 
less, a conservative estimate for the incidence of 
FAS (mild to severe forms) in the Western world is 
1 per 1,000 live births, with most cases going undi- 
agnosed and unrecognized (Abel, 1995). Spohr and 
Steinhausen (1996) provide an excellent review of 
research on the FAS syndrome. 


Effects of Environmental Toxins on Intelligence 


Many industrial chemicals and by-products may 
impair the nervous system temporarily, or even 
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cause permanent damage that affects intelligence. 
Examples include lead, mercury, manganese, ar- 
senic, thallium, tetra-ethyl lead, organic mercury 
compounds, methyl bromide, and carbon disul- 
phide (Lishman, 1997). Certainly, the most widely 
studied of these environmental toxins is lead, which 
we examine in modest detail here. 

Sources of human lead absorption include eat- 
ing of lead-pigmented paint chips by infants and 
toddlers; breathing of particulate lead from smelter 
emissions or automobile combustion of leaded 
gasoline; eating of food from lead-soldered cans or 
lead-glazed pottery; and the drinking of water that 
has passed through lead pipes. Because the human 
body excretes lead slowly, most citizens of the in- 
dustrialized world carry a lead burden substantially 
higher—perhaps 500 times higher—than known in 
pre-Roman times (Patterson, 1980). 

The hazards of high-level lead exposure are ac- 
knowledged by every medical and psychological 
researcher who has investigated this topic. High 
doses of lead are irrefutably linked to cerebral 
palsy, seizure disorders, blindness, mental retarda- 
tion, even death. The more important question per- 
tains to “asymptomatic” lead exposure: Can a level 
of absorption that is insufficient to cause obvious 
medical symptoms nonetheless produce a decre- 
ment in intellectual abilities? 

Research findings on this topic are complex and 
controversial. Using tooth lead from shed teeth of 
young children as their index of cumulative lead 
burden, Needleman and associates (1979) reported 
that “asymptomatic” lead exposure was associated 
with decrements in overall intelligence (about 4 IQ 
points) and lowered performance on verbal sub- 
tests, auditory and speech processing tests, and a 
reaction time measure of attention. These differ- 
ences persisted at follow-up 11 years later (Needle- 
man, Schell, Bellinger, Leviton, & Allred, 1990). 
Yet, using a similar study method, Smith, Delves, 
Lansdown, Clayton, and Graham (1983) found a 
nonsignificant effect from children’s lead exposure 
when social factors such as the parents’ level of ed- 
ucation and social status were controlled. 

In part, research findings on this topic are con- 
tradictory because it is difficult to disentangle the 


effects of lead from those of poverty, stress, poor 
nutrition, and other confounding variables (Kauf- 
mann, 2001ab). Most likely, asymptomatic lead ex- 
posure has harmful effects upon the nervous system 
that translate to reduced intelligence, impaired at- 
tention, and a host of other undesirable behavioral 
consequences. Even in the absence of a scientific 
consensus on this point, prudence dictates that we 
should reduce lead exposure in humans to the low- 


est levels possible. 
ORIGINS OF AFRICAN AMERICAN 
AND WHITE IQ DIFFERENCES 


African American and White IQ Differences 





As previously noted, African Americans score, on 
the average, about 15 points lower than white Amer- 
icans on standardized IQ tests. This difference is not 
trivial, amounting to a full standard deviation. Al- 
though the IQ difference fluctuates from one analy- 
sis to the next—as small as 10 points in some studies 
but as large as 20 points in others—the disparity has 
been documented in numerous samples with a wide 
variety of tests. Controlling for social class reduces 
the difference but does not eliminate it. Furthermore, 
the magnitude of the difference apparently remained 
impervious to change throughout the mid- to late 
twentieth century (Jensen, 1980; Kennedy, Van de 
Riet, & White, 1963; Shuey, 1966). 

Figure 7.9 portrays an earlier study of IQ dif- 
ferences between African Americans and white 
Americans, based upon the 1960 edition of the 
Stanford-Binet. The reader will notice that, on av- 
erage, the white sample (M = 101.8) outscored the 
African American sample (M = 80.7) by slightly 
more than 20 IQ points. More recently, a similar 
pattern has emerged on the WAIS-R, with whites 
(M = 101.4) in the standardization sample outscor- 
ing African Americans (M = 86.9) by 14% IQ 
points (Reynolds, Chastain, Kaufman, & McLean, 
1987). The race difference also surfaces on the up- 
dated Stanford-Binet: Fourth Edition, with whites 
(M = 103.5) in the standardization sample outscor- 
ing African Americans (M = 86.1) by nearly 17% 
points (Thorndike, Hagen, & Sattler, 1986). 
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FIGURE 7.9 Stanford-Binet IQ Distributions for African Americans and for White Americans 
Source: Reprinted with permission from Kennedy, W. A., Van de Riet, V., & White, J. C., Jr. (1963). A norma- 
tive sample of intelligence and achievement of negro elementary school children in the southeast United States. 
Monographs of the Society for Research in Child Development, 28(90), 68. Copyright © by The Society for Re- 


search in Child Development, Inc. 


When demographic variables such as socioe- 
conomic status are taken into account, the size of 
the mean difference reduces to .5 to .7 standard de- 
viations, or 7 to 10 IQ points, but remains robust 
(Reynolds & Brown, 1984a). The actuality of a race 
difference in IQ scores has been replicated so many 
times that it is no longer the focus of serious dis- 
pute. However, the interpretation of the well-doc- 
umented finding is fiercely debated. 

One viewpoint (discussed previously) is that 
the observed IQ disparity is caused, partly or 
wholly, by test bias. This is a popular and widely 
held viewpoint that is rarely supported by technical 
studies of test bias, as noted previously. Test bias 
may play a minor role in race differences, but it 
cannot explain the large and persistent differences 
in IQ scores between African Americans and white 
Americans. Here, we intend to examine a different 
hypothesis, namely: Is the IQ difference between 


African Americans and white Americans primarily 
genetic? 


The Genetic Hypothesis for 
Race Differences in IQ 


The hypothesis of a genetic basis for race differ- 
ences in IQ first gained scholarly prominence in 
1969 when Arthur Jensen published a provocative 
paper titled “How Much Can We Boost IQ and 
Scholastic Achievement?” (Jensen, 1969). Jensen 
set the tone for his paper in the opening sentence 
when he asserted that “compensatory education has 
been tried and it apparently has failed.” He further 
contended that compensatory education programs 
were based on two fallacious theoretical underpin- 
nings, namely, the “average child concept,” which 
views children as more or less homogeneous, and 
the “social deprivation hypothesis,’ which asserts 
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that environmental deprivation is the primary cause 
of lowered achievement and IQ scores. Jensen ar- 
gued forcefully against both suppositions. Further- 
more, leaning heavily upon the literature in 
behavior genetics, Jensen implied that the reason 
whites scored higher than African Americans on IQ 
tests was probably related more to genetic factors 
than to the effects of environmental deprivation. 
The thrust of his paper was to suggest that, since 
compensatory education has proved ineffectual, 
and since the evidence suggests a strong genetic 
component to IQ, therefore it is appropriate to en- 
tertain a genetic explanation for the well-docu- 
mented difference in favor of whites on IQ tests. He 
formulated the genetic hypothesis in a careful, ten- 
tative, scholarly manner: 


The fact that a reasonable hypothesis has not been 
rigorously proved does not mean that it should be 
summarily dismissed. It only means that we need 
more appropriate research for putting it to the test. 
I believe such definitive research is entirely possi- 
ble but has not been done. So all we are left with 
are various lines of evidence, no one of which is 
definitive alone, but which, viewed all together, 
make it a not unreasonable hypothesis that genetic 
factors are strongly implicated in the average 
Negro-white intelligence difference. The prepon- 
derance of the evidence is, in my opinion, less con- 
sistent with a strictly environmental hypothesis 
than with a genetic hypothesis, which, of course, 
does not exclude the influence of environment or 
its interaction with genetic factors. (Jensen, 1969) 


With the articulation of a genetic hypothesis for 
race differences in IQ, Jensen provoked an intense 
debate that has raged.on, with periodic lulls, to the 
present day. 

In the mid-1990s the controversy over a genetic 
basis for race differences in IQ was intensified once 
again with the publication of The Bell Curve by 
Richard Herrnstein and Charles Murray (1994). 
This massive tome was primarily a book about the 
importance of IQ as a predictor of poverty, school 
leaving, unemployment, illegitimacy, crime, and a 
host of other social pathologies. But two chapters on 
ethnic differences in intelligence caused an uproar 
among social scientists and the lay public. The au- 
thors reviewed dozens of studies and concluded that 


the IQ gap between African Americans and whites 
has changed little in this century. They also argued 
that test bias cannot explain the race differences. 
Furthermore, they noted that races differ not just in 
average IQ scores but also in the profile of intellec- 
tual abilities. In addition, they concluded that intel- 
ligence is only slightly malleable even in the face of 
intensive environmental intervention. As did Jensen, 
Herrnstein and Murray (1994) stated their genetic 
hypothesis with considerable circumspection: 


It seems highly likely to us that both genes and the 
environment have something to do with racial dif- 
ferences [in cognitive ability]. What might the mix 
be? We are resolutely agnostic on that issue; as far 
as we can determine, the evidence does not yet 
justify an estimate. 


Although the authors declined to provide an esti- 
mate of the genetic contribution to race differences 
in IQ, it is clear from the tone of their pessimistic 
book that they believe it to be substantial. Is such a 
disturbing conclusion warranted by the evidence? 


Tenability of the Genetic Hypothesis 


The genetic hypothesis for race IQ differences is an 
unpopular idea that is anathema to many laypersons 
and social scientists. But contempt for an idea does 
not constitute disproof, and superficiality is no sub- 
stitute for a reasoned examination of evidence. In 
light of additional analysis and research, is the ge- 
netic hypothesis for IQ differences tenable? We 
will examine three lines of evidence here which in- 
dicate that the answer is “No.” 

Several critics have pointed out that the genetic 
hypothesis is based on the questionable assumption 
that evidence of IQ heritability within groups can 
be used to infer heritability between racial groups. 
Jensen (1969) expressed this premise rather ex- 
plicitly, pointing to the substantial genetic compo- 
nent in IQ as suggestive evidence that differences 
in IQ between African Americans and white Amer- 
icans are, in part, genetically based. Echoing earlier 
critics, Kaufman (1990) responds as follows: 

One cannot infer heritability between groups from 


studies that have provided evidence of the IQ’s 
heritability within groups. Even if IQ is equally her- 
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itable within the black and white races separately, 
that does not prove that the IQ differences between 
the races are genetic in origin. Scarr-Salapatek’s 
(1971, p. 1226) simple example explains this point 
well: Plant two randomly drawn samples of seeds 
from a genetically heterogeneous population in two 
types of soil—good conditions versus poor condi- 
tions—and compare the heights of the fully grown 
plants. Within each type-of soil, individual varia- 
tions in the heights are genetically determined; but 
the average difference in height between the two 
samples is solely a function of environment. 


Another criticism of the genetic hypothesis is 
that careful analysis of environmental factors pro- 
vides a sufficient explanation of race differences in 
IQ; that is, the genetic hypothesis is simply unnec- 
essary. This is the approach taken by Brooks-Gunn, 
Klebanov, and Duncan (1996) in a study of 483 
African American and white low-birthweight chil- 
dren. What makes their study different from other 
similar analyses is the richness of their data. Instead 
of using only one.or two measures of the environ- 
ment (e.g., a single index of poverty level), they 
collected longitudinal data on income level and 
many other cofactors of poverty such as length of 
hospital stay, maternal verbal ability, home learn- 
ing environment, neighborhood condition, and 
other components of family social class. When the 
children’s IQs were tested at age 5 with the WPPSI, 
the researchers found the usual disparity between 
the white children (mean IQ of 103) and the African 
American children (mean IQ of 85). However, 
when poverty and its cofactors were statistically 
controlled, the IQ differences were almost com- 
pletely eliminated. Their study suggests that previ- 
ous research has underestimated the pervasive 
effects of poverty and its cofactors as a contribu- 
tion to African American and white IQ differences. 

A third criticism of the genetic hypothesis is 
that race as a biological entity is simply nonexis- 
tent; that is, there are no biological races. Fish 
(2002) and other proponents of this viewpoint 
argue that “race” is a socially constructed concept, 
not a biological reality: 

Homo sapiens has no extant subspecies: There are 


no biological races. Human physical appearance 
varies gradually around the planet, with the most 


geographically distant peoples generally appearing 
the most different from one another. The concept of 
human biological races is a construction socially 
and historically localized to 17th and 18th-century 
European thought. Over time, different cultures 
have developed different sets (folk taxonomies) of 
socially defined “races.” (p. 29) 


Put another way, racial categories are social con- 
structions based upon superficial physical differ- 
ences (especially skin color) that serve cultural- 
psychological objectives (e.g., reducing uncertainty 
about how we should respond to one another). 
However, racial categories do not signify meaning- ` 
ful biological differences. A biologist expresses the 
point this way: “All of humanity shares in common 
the vast majority of its molecular genetic variation 
and the adaptive traits that define us as a single 
species” (Templeton, 2002, p. 51). Thus, insofar as 
race has no biological reality, the argument that 
“race” differences in IQ originate from a genetic 
basis is not only pernicious, it is also absurd. 
Neisser, Boodoo, Bouchard, and others (1996) 
offer additional perspectives on race differences in 
IQ and related topics. 

Before leaving the topic of race differences in 
IQ, we should point out that the emotion attached 
to this topic is largely undeserved, for two reasons. 
First, racial groups always show large overlaps in 
IQ—-meaning that the peoples of the earth are much 
more alike than they are different. Second, as pre- 
viously noted, the existing race differences in IQ 
certainly reflect cultural differences and environ- 
mental factors to a substantial degree. Wilson 
(1994) has catalogued the numerous differences in 
cultural background between African Americans 
and white Americans. In 1992, for example, 64 per- 
cent of African American parents were divorced, 
separated, widowed, or never married; 63 percent 
of African American births were to unmarried 
mothers; and 30 percent of African American births 
were to adolescents (U.S. Bureau of the Census, 
1993). On average, these realities of family life for 
many African Americans inevitably will lead to 
lowered performance on intelligence tests. Lest the 
reader conclude that we are hereby endorsing a 
subtle form of Anglocentric superiority, consider 
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Lynn’s (1987) conclusion that the mean IQ of the 
Japanese is 107, a full 7 points higher than the av- 
erage for American whites. So what? 


Il AGE CHANGES IN INTELLIGENCE 


We turn now to another controversial topic— 
whether intelligence declines with age. Certainly, 
one of the most pervasive stereotypes about aging 
is that we lose intellectual ability as we grow older. 
This stereotype is so pervasive that few laypersons 
question it. But we should question it. 

In general, the empirical study of this topic pro- 
vides a more optimistic conclusion than the com- 
mon stereotype suggests. However, the research 
also reveals that age changes in intelligence are 
complex and multifaceted. The simple question, 
“Does intelligence decline with age?” turns out to 
have several labyrinthine answers. 

We can trace the evolution of research on age- 
related intellectual changes as follows: 


1. Early cross-sectional research with instruments 
such as the WAIS painted a somber picture of a 
slow decline in general intelligence after age 15 
or 20 and a precipitously accelerated descent 
after age 60. 

2. Just a few years later, more sophisticated stud- 
ies using sequential testing with multidimen- 
sional instruments such as the Primary Mental 
Abilities Test suggested a more optimistic tra- 
jectory for intelligence: minimal change in most 
abilities until at least age 60. 

3. Parallel research utilizing the fluid/crystallized 
distinction posited a gradual increase in crystal- 
lized intelligence virtually to the end of life, 
juxtaposed against a rapid decline in fluid 
intelligence. 

4. Most recently, a few psychologists have pro- 
posed that adult intelligence is qualitatively dif- 
ferent, akin to a new Piagetian stage that might 
be called postformal reasoning. This research 
calls into question the ecological validity of using 
standard instruments with older examinees. 


We examine each of these research epochs in more 
detail in the following sections. 


Early Cross-Sectional Research 


One of the earliest comprehensive studies of age 
trends on an individually administered intelligence 
test was reported by Wechsler (1944) shortly after 
publication of the Wechsler-Bellevue Form I. As is 
true of all the Wechsler tests designed for adults, 
raw scores on the W-B I subtests were first trans- 
formed into standard scores (referred to as scaled 
scores) with a mean of 10 and an SD of 3. Regard- 
less of the age of the subject, these scaled scores 
were based on a fixed reference group of 350 sub- 
jects ages 20 to 34 included in the standardization 
sample. By consulting the appropriate age table, the 
sum of the 11 scaled scores was then used to find 
an examinee’s IQ. 

However, the sum of the scaled scores by itself 
is a direct index of an examinee’s ability relative to 
the reference group. Wechsler used this index to 
chart the relationship between age and intelligence 
(Figure 7.10). His results indicated a rapid growth 
in general intelligence in childhood through age 15 
or 20, followed by a slow decline to age 65. He was 
characteristically blunt in discussing his findings: 


If the fact that intellectual growth stops at about the 
age of fifteen has been a hard fact to accept, the indi- 
cation that intelligence after attaining its maximum 
forthwith begins to decline just as any other physio- 
logical capacity, instead of maintaining itself at its 
highest level over a long period of time, has been 

an even more bitter pill to swallow. It has, in fact, 
proved so unpalatable that psychologists have gener- 
ally chosen to avoid noticing it. (Wechsler, 1952) 


Normative studies with subsequent Wechsler 
adult tests revealed exactly the same pattern, both 
for the WAIS (Wechsler, 1955), the WAIS-R 
(Wechsler, 1981), and the WAIS-III (Tulsky, Zhu, 
& Ledbetter, 1997). Later investigators also ex- 
tended the older age limit tested to age 70 and 80, 
finding a progressive and accelerating rate of de- 
cline in overall test performance that seemed to 
confirm Wechsler’s belief that it was “normal” for 
general intelligence to decline after young adult- 
hood (Eisdorfer & Cohen, 1961). 

Overlooked by Wechsler and many other cross- 
sectional design researchers was the influence of 
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FIGURE 7.10 The Curve of Growth and Supposed Decline on the Wechsler-Bellevue Form I 
Source: From Wechsler, D. (1944). Measurement of adult intelligence (3rd ed.). Baltimore: Williams & 


Wilkins. Reprinted with permission of Oxford University Press. 


their methodology upon their findings. It has been rec- 
ognized for quite some time that cross-sectional stud- 
ies often confound age effects with educational 
disparities or other age-group differences (see Baltes, 
Reese, & Nesselroade, 1977; Kausler, 1991). For ex- 
ample, in the normative studies of the Wechsler tests, 
it is invariably true that the younger standardization 
subjects are better educated than the older ones. In all 
likelihood, the lower scores of the older subjects are 
caused, in part, by these educational differences rather 
than signifying an inexorable age-related decline. 


Sequential Studies of Intelligence 


To control for age-group differences, many re- 
searchers prefer a longitudinal design in which the 
same subjects are retested one or more times over 
periods of 5 to 10 years and, in rare cases, up to 40 
years later. Because there is only one group of sub- 
jects, longitudinal designs eliminate age-group dis- 
parities (e.g., more education in the young than the 
old subjects) as a confounding factor. However, the 
longitudinal approach is not without its shortcom- 
ings. Longitudinal studies suffer from four poten- 
tial pitfalls: 


1. Time of measurement is the most serious prob- 
lem. Major historical events such as an economic 
depression can warp the intellectual and psycho- 
logical development of entire generations. As a 
result, longitudinally measured age changes may 
reflect the peculiarities of the time of measure- 
ment rather than any universal age effects. 

2. Selective attrition—the less capable subjects 
may be the most likely to drop out, artificially 
inflating the mean scores of retested subjects. 

3. Practice effects—examinees improve when they 
take the same test two, three, even five times. 

4. Regression to the mean—especially a problem 
when participants are selected because of their 
initial extreme scores such as very low IQ scores 
(Hayslip & Panek, 1989). 


The most efficient research method for study- 
ing age changes in ability is a cross-sequential 
design that combines cross-sectional and longitu- 
dinal methodologies (Schaie, 1977): 


In brief, the researchers begin with a cross-sectional 
study. Then, after a period of years, they retest these 
subjects, which provides longitudinal data on sev- 
eral cohorts—a longitudinal sequence. At the same 
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time, they test a new-group of subjects, forming a 
second cross-sectional study—and, together with 
the first cross-sectional study, a cross-sectional se- 
quence. This whole process can be repeated over 
and over (every five or ten years, say) with retesting 
of old subjects (adding to the longitudinal data) and 
first-testing of new subjects (adding to the cross- 
sectional data). (Schaie & Willis, 1986) 


In 1956, Schaie began the most comprehensive 
cross-sequential study ever conducted in what is re- 
ferred to as the Seattle Longitudinal Study (Schaie, 
1958, 1996). He administered Thurstone’s test of 
five primary mental abilities (PMAs) and other 
intelligence-related measures to an initial cross- 
sectional sample of 500 community-dwelling adults. 
The PMA Test subtests include Verbal Meaning, 
Space, Reasoning, Number, and Word Fluency. In 
1963, he retested these subjects and added a new 
cross-sectional cohort. Additional waves of data 
were collected in 1970, 1977; 1984, and 1991. 

Three conclusions emerged from Schaie’s 
cross-sequential study of adult mental abilities: 


1, Each cross-sectional study indicated some 
degree of apparent age-related decrement in 
mental abilities, postponed until after age 50 
for some abilities, but. beginning after age 35 
for others. In particular, Number skills and 
Word Fluency showed an age-related decrement 
only after age 50, whereas Verbal Meaning, 
Space, and Reasoning scores appeared to de- 
cline sooner, after age 35. 

2. Successive cross-sectional studies—the cross- 
sectional sequence—revealed significant inter- 
generational differences in favor of those born 
most recently. Even holding age constant, those 
born and tested most recently performed better 
than those born and tested at an earlier time. For 
example, 30-year-old examinees tested in 1977 
tended to score better than 30-year-old exami- 
nees tested in 1970, who tended to score better 
than 30-year-old examinees tested in 1963, who, 
in turn, outperformed 30-year-old examinees 
tested in 1956. However, these cohort differ- 
ences in intelligence were not uniform across 
the different abilities measured by the PMA 
Test. The pattern of rising abilities was most 
apparent for Verbal Meaning, Reasoning, and 


Space. Cohort changes for Number and Word 
Fluency were more complex and contradictory. 
3. In contrast to the moderately pessimistic find- 
ings of the cross-sectional comparisons, the lon- 
gitudinal comparisons showed a tendency for 
mean scores either to rise slightly or to remain 
constant until approximately age 60 or 70. The 
only exceptions to this trend involved highly 
speeded tests such as Word Fluency, in which 
the examinee must name words in a given cate- 
gory as quickly as possible, and Number, in 
which the examinee must complete arithmetic 
computations quickly and accurately. 


The results of the Schaie study are even more 
optimistic when individual longitudinal findings 
are disentangled from the group averages. As pre- 
viously noted, the longitudinal findings differed 
from one mental ability to another. Nonetheless, 
taking the average of the five PMAs and using the 
25th percentile for 25-year-olds as his standard of 
meaningful decline, Schaie has shown that no more 
than 25 percent of those studied had declined by 
age 67. From age 67 to age 74 about a third of the 
subjects had declined, whereas from age 74 to age 
81, slightly more than 40 percent had declined 
(Schaie, 1980, 1996; Schaie & Willis, 1986). In 
sum, the vast majority of us show no meaningful 
decline in the skills measured by the Primary Men- 
tal Abilities Test until we are well into our seven- 
ties. Perhaps even more impressive is the fact that 
approximately 10 percent of the sample improved 
significantly when retested in their seventies and 
eighties. Based on his research and other longitu- 
dinal studies, Schaie arrives at this conclusion: 


If you keep your health and engage your mind with 
the problems and activities of the world around 
you, chances are good that you will experience lit- 
tle if any decline in intellectual performance in 
your lifetime. That’s the promise of research in the 
area of adult intelligence. (Schaie & Willis, 1986) 


We concur with this resolution, but it would be un- 
fair to leave the impression that all authorities in 
this area agree. Horn and Cattell have been the 
most vocal skeptics, arguing for a significant age- 
related decrement in fluid intelligence because of 
its reliance upon neural integrity, which is pre- 
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sumed to decline with advancing age (Horn & Cat- 
tell, 1966; Horn, 1985). Cross-sectional studies 
certainly support this view. For example, Wang 
and Kaufman (1993) plotted age differences in vo- 
cabulary and matrices scores from the Kaufman 
Brief Intelligence Test and found little change in 
vocabulary (crystallized measure) but a sharp drop 
in matrices (fluid measure). These results held true 
even when the scores were adjusted for educa- 
tional level (Figure 7.11). Of course, cross-sec- 
tional studies are open to rival interpretations and 
can therefore only suggest longitudinal patterns. 
Readers who wish to pursue this controversy 
should consult Kausler (1991) and Lindenberger 
and Baltes (1994). 


| GENERATIONAL CHANGES IN 
|| INTELLIGENCE TEST SCORES 


What happens to the intelligence of a popula- 
tion from one generation to the next? For example, 
how does the intelligence of Americans in the 
1990s compare to the intelligence of their fore- 
bears in the early 1900s? We might expect that any 
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FIGURE 7.11 

Mean Adjusted Standard Scores 
on the K-BIT as a Function 

of Age 

Source: Reprinted with permission 
from Wang, J., & Kaufman, A. 
(1993). Changes in fluid and crys- 
tallized intelligence across the 20- 
to 90-year age range on the K-BIT. 
Journal of Psychoeducational 
Assessment, 11, 29-37. 


differences would be small. After all, the human 
gene pool has remained essentially constant for 
centuries, perhaps millennia. Furthermore, only a 
small fraction of any generation is exposed to the 
extremes of environmental enrichment or depri- 
vation that might stunt or boost intelligence dra- 
matically. Thus, common sense dictates that any 
generational changes in population intelligence 
would be minimal. 

On this issue, common sense appears to be in- 
correct. Flynn (1984, 1987, 1994) has charted the 
standardization data from the successive editions of 
the Stanford-Binet and the Wechsler tests from 1932 
to 1981 and found that each edition established a 
higher standard than its predecessor (Table 7.5). The 
total gain amounts to an apparent rise in mean IQ of 
13.8 points. Thus, a person who earned a Full Scale 
IQ of 100 in 1990 would have earned, on average, 
an IQ of 114 in 1932! A similar rising performance 
has been observed in many other industrialized 
nations using such tests as Raven’s Progressive 
Matrices (Flynn, 1987). 

Apparently, citizens of the Western industrial- 
ized nations have become better educated and more 
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TABLE 7.5 Comparative Average IQs from Successive Editions of the Stanford-Binet and Wechsler Tests 


Dates 
Test Combinations Test 1 Test 2 
1. SB-L & WISC 1932 1947.5 
2. SB-M & WISC 1932 1947.5 
3. SB-LM & WISC 1932 1947.5 
4. SB-L & WAIS 1932 1953.5 
5. SB-LM & WAIS 1932 1953.5 
6. SB-LM & WPPSI 1932 1964.5 
7. SB-LM & SB-72 1932 1971.5 
8. WB-I & WISC 1936.5 1947.5 
9. WB-I & WAIS 1936.5 1953.5 
10. WISC & WAIS 1947.5 1953.5 
11. WISC & WPPSI 1947.5 1964.5 
12. WISC & SB-72 1947.5 1971.5 
13. WISC & WISC-R 1947.5 1972 
14. WAIS & WISC-R 1953.5 1972 
15. WAIS & WAIS-R 1953.5 1978 
16. WPPSI & SB-72 1964.5 1971.5 
17. WPPSI & WISC-R 1964.5 1972 
18. WISC-R & WAIS-R 1972 1978 
19. WISC-R & WISC-III 1972 1988 
20. WAIS-R & WISC-III 1978 1988 
21. WPPSI-R & WISC-III 1986 1988 
22. WISC-III & WAIS-III 1988 1996 


Years 


15.5 
15.5 
15.5 
21.5 
21.5 
32.5 
39.5 
11 
17 

6 
17 
24 


8 





Ages Means 
Years N Test 1 Test 2 Gain 
5-15 1,563 107.13 101.64 5.49 
5 46 125.13 107.56 17.57 
5-15 460 114.64 109.67 4.97 
16-32 271 113.02 105.48 7.54 
16-48 79 109.08 101.75 7.33 
4-6 416 101.74 92.78 8.96 
2-18 2,351 107.08 97.19 9.89 
11-14 110 103.51 105.54 —2.03 
16-39 152 122.94 118.25 4.69 
14-17 436 101.76 99.12 2.64 
5-6 108 93.56 90.86 2.70 
6-10 30 96.40 84.42 11.98 
6-15 1,042 97.19 88.78 8.41 
16-17 40 102.94 96.29 6.65 
35-44 72 109.69 101.65 8.04 
4-5 35 93.06 88.65 4.41 
5-6 140 112.84 108.58 4.26 
16 80 99.61 98.65 0.96 
6-16 206 108.20 102.90 5.30 
16 189 105.30 101.40 3.90 
6 188 106.50 102.50 4.00 
16 184 104.60 103.90 ‚70 





Note: Dates refers to dates of standardization for the comparison tests. Years refers to the years between standardization. Ages refers to the 


age range of tested subjects. Gain refers to the apparent gain in IQ. 


Source: Adapted with permission from Flynn, J. R. (1984). The mean IQ of Americans: Massive gains 1932 to 1978. Psychological Bul- 


letin, 95, 29-51. 


literate in this century, causing mean IQ levels to 
rise quite sharply. However, IQ gains of this mag- 
nitude pose a serious problem of causal explana- 
tion. Flynn (1994) is skeptical that any real and 
meaningful intelligence of a population could vault 
upward so quickly. He concludes that current tests 
do not measure intelligence but rather a correlate 
with a weak causal link to intelligence: 


Psychologists should stop saying that IQ tests mea- 
sure intelligence. They should say that IQ tests 
measure abstract problem-solving ability (APSA), 
a term that accurately conveys our ignorance. We 
know people solve problems on IQ tests; we sus- 
pect those problems are so detached, or so ab- 


stracted from reality, that the ability to solve them 
can diverge over time from the real-world problem- 
solving ability called intelligence; thus far we 
know little else. (Flynn, 1987) 


Although Flynn’s radical prescription is not widely 
endorsed by experts in the field, he has sensitized 
psychometricians to the dangers of rendering con- 
clusions based on ever-shifting intelligence test 
norms. IQ gains over time make it imperative to re- 
standardize tests frequently, otherwise examinees 
are being scored with obsolete norms and will re- 
ceive inflated IQ scores (Flynn, 1994). Neisser 
(1998) has edited a book that explores possible ex- 
planations of the rising curve of IQ scores. 
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SUMMARY 


1. Test bias is defined as differential validity 
of a given interpretation of a test score for any de- 
finable, relevant subgroup of test takers. As an ex- 
ample of test bias, an oral arithmetic test delivered 
in English might function well for English-speak- 
ing children, whereas for Hispanic children, the 
same test might predict arithmetic skill poorly. 


2. Bias in content validity is demonstrated 
when an item or subscale of a test is relatively more 
difficult for members of one group than another 
after general ability level is held constant. In gen- 
eral, for the major standardized tests of ability and 
aptitude, evidence of bias in content validity is 
scant or nonexistent. 


3. Bias in predictive or criterion-related va- 
lidity is demonstrated when a test does not predict 
a relevant criterion equally well for persons from 
different subpopulations. An unbiased test pos- 
sesses homogeneous regression: The results for all 
relevant subpopulations cluster equally well around 
a single regression line. 


4. Bias in construct validity is demonstrated 
when a test is shown to measure different traits or 
constructs for one group than another. In compar- 
isons across relevant subpopulations, a nonbiased 
test will reveal a high degree of similarity for the 
factorial structure of the test and the rank order of 
item difficulties within the test. 


5. Test fairness incorporates social values and 
philosophies of test use. Three philosophies have 
been outlined: unqualified individualism (select the 
best person using all predictors), quotas (select by 
ratios), and qualified individualism (select the best 
person not using race, sex, etc., as predictors). 
Which of these philosophies is correct? This is an 
ethical question that is beyond objective solution. 


6. The genetic contribution to human charac- 
teristics is usually measured in terms of a heri- 
tability index that can vary from 0.0 to 1.0. 
Heritability is an estimate of how much of the total 
variance in a given trait is caused by genetic factors. 
Heritability is relative to the population sampled 


and does not explain individual scores. For IQ; 
most estimates of heritability are around .50. 


7. Evidence of a genetic contribution 'to intel- 
ligence is documented by the Minnesota twin stud- 
ies in which identical twins separated at birth are 
reunited for extensive psychometric testing. Even 
though many twin pairs were reared in dissimilar 
environments, their adult IQs are remarkably sim- 
ilar. These findings corroborate earlier adoption 
studies. 


8. Jensen (1977) studied the cumulative ef- 
fects of a bleak, deprived environment on the IQs 
of rural African American children. He found that 
older African American children scored lower than 
their younger brothers and sisters, apparently los- 
ing about one IQ point a year, on average, between 
the ages of 6 and 16. 


9. Scarr and Weinberg studied the effects 
of environmental enrichment: They found. that 
African American children adopted into upper- 
middle-class white families showed above-average 
IQs. 

10. Heavy drinking by pregnant women causes 
their offspring to be at very high risk for fetal alco- 
hol syndrome, characterized by facial abnormali- 
ties, growth deficiencies, motor problems, 
hyperactivity, and mild to moderate mental retar- 
dation. With lower levels of drinking, offspring 
may show attentional impairment and other subtle 
problems known as fetal alcohol effect. 


11. Environmental toxins may also affect in- 
telligence. For example, children who absorb 
undue amounts of lead (such as by eating lead pig- 
mented paint chips) may evidence long-term decre- 
ments in mental functioning (lowered IQ, problems 
with auditory and speech processing, and slowed 
reaction time). 


12. On the average, African Americans score 
about 15 points lower than white Americans on 
standardized IQ tests. When demographic variables 
such as social class are accounted for, the differ- 
ence reduces to 7 to 10 IQ points. Apparently, the 
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magnitude of the difference has remained constant 
in the mid- to late twentieth early twenty-first 
centuries. 


13. Jensen (1969) and others have proposed 
that whites score higher than African Americans on 
IQ tests partly because of genetic factors. This hy- 
pothesis is based on the questionable assumption 
that evidence of IQ heritability within groups can 
be used to infer heritability between racial groups. 
Research on racial admixture and IQ does not sup- 
port a genetic view. 


14. Longitudinal research on age and intelli- 
gence provides a more optimistic perspective than 


does cross-sectional research. In longitudinal stud- 
ies, most abilities change little until at least age 60. 
Fluid abilities—largely nonverbal and culture- 
reduced mental efficiency—show a greater age de- 
cline than other abilities. 


15. Flynn has charted the standardization data 
for each edition of the Stanford-Binet and the 
Wechsler scales from 1932 to the present time. 
Each test established a higher standard than its pre- 
decessor, with a total gain of about 14 IQ points. 
These apparent IQ gains pose serious problems of 
explanation and indicate that norms for tests may 
shift very rapidly. 
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Summary 


Key Terms and Concepts 


|: this chapter, we examine a variety of instru- 
ments traditionally grouped under the headings 
of aptitude tests and achievement tests. The cover- 
age includes relevant instruments, but also em- 
braces issues and applications in aptitude and 
achievement testing. In Topic 8A, Aptitude Tests 
and Factor Analysis, the use of factor analysis in 
the development of aptitude measures is described. 
This is followed by a review of typical instruments, 
including multiple aptitude test batteries and tests 
used to predict academic performance in college. 
In Topic 8B, Group Tests of Achievement, we 
examine the educational achievement test batteries 
familiar to every student of American schooling. In 
addition, the reader will encounter a brief discus- 
sion of special-purpose tests for achievement as 
well as a review of troubling social issues that 
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Aptitude Tests and Factor Analysis 


pertain to school system cheating on achievement 
tests. 

Here we focus on aptitude tests, especially the 
multiple aptitude batteries commonly used to pre- 
dict performance in school, employment, and mil- 
itary settings. Typically, multiple aptitude batteries 
perform a gatekeeper function. School admission, 
corporate employment, and military entry may 
hinge upon findings from the tests discussed here. 
Aptitude tests command great respect and therefore 
possess immense influence in modern society. The 
validity of aptitude tests is indeed consequential. 
The reader will learn more about the application of 
aptitude tests later in this topic. 

Many aptitude tests arose as specialized off- 
shoots of ability tests shortly after psychologists de- 
veloped the necessary statistical tools for portioning 
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general intelligence into its subcomponents. Put 
simply, most aptitude tests owe their origin to fac- 
tor analysis, a family of procedures that researchers 
use to summarize relationships among variables 
that are correlated in highly complex ways. Because 
aptitude tests could not flourish without factor 
analysis, we begin this section with a primer of this 
useful statistical technique. The topic then contin- 
ues with a discussion of prominent tests of aptitude, 
including multi-aptitude batteries useful for em- 
ployment counseling (Differential Aptitude Test, 
General Aptitude Test Battery, and Armed Services 
Vocational Assessment Battery), tests used for col- 
lege admission (Scholastic Assessment Tests and 
American College Test), and postgraduate admis- 
sion tests (Graduate Record Exam, Medical College 
Admission Test, and Law School Admission Test). 


|| A PRIMER OF FACTOR ANALYSIS 


Broadly speaking, there are two forms of factor 
analysis: confirmatory and exploratory. In confir- 
matory factor analysis, the purpose is to confirm 
that test scores and variables fit a certain pattern 
predicted by a theory. For example, if the theory un- 
derlying a certain intelligence test prescribed that 
the subtests belong to three factors (e.g., verbal, 
performance, and attention factors), then a confir- 
matory factor analysis could be undertaken to eval- 
uate the accuracy of this prediction. Confirmatory 
factor analysis is essential to the validation of many 
ability tests. 

The central purpose of exploratory factor 
analysis is to summarize the interrelationships 
among a large number of variables in a concise and 
accurate manner as an aid in conceptualization 
(Gorsuch, 1983). For instance, factor analysis may 
help a researcher discover that a battery of 20 tests 
represents only four underlying. variables, called 
factors. The smaller set of derived factors can be 
used to represent the essential constructs that un- 
derlie the complete group of variables. 

Perhaps a simple analogy will ciarify the nature 
of factors and their relationship to the variables or 
tests from which they are derived. Consider the 
track-and-field decathlon, a mixture of 10 diverse 


events including sprints, hurdles, pole vault, shot 
put, and distance races, among others. In concep- 
tualizing the capability of the individual decathlete, 
we do not think exclusively in terms of the partici- 
pant’s skill in specific events. Instead, we think in 
terms of more basic attributes such as speed, 
strength, coordination, and endurance, each of 
which is reflected to a different extent in the indi- 
vidual events. For example, the pole vault requires 
speed and coordination, while hurdle events de- 
mand coordination and endurance. These inferred 
attributes are analogous to the underlying factors of 
factor analysis. Just as the results from the 10 
events of a decathlon may boil down to a small 
number of underlying factors (e.g., speed, strength, 
coordination, and endurance), so too may the re- 
sults from a battery of 10 or 20 ability tests reflect 
the operation of a small number of basic cognitive 
attributes (e.g., verbal skill, visualization, calcula- 
tion, and attention, to cite a hypothetical list). This 
example illustrates the goal of factor analysis: to 
help produce a parsimonious description of large, 
complex data sets. 

We will illustrate the essential concepts of fac- 
tor analysis by pursuing a classic example con- 
cerned with the number and kind of factors that best 
describe student abilities. Holzinger and Swineford 
(1939) gave 24 ability-related psychological tests 
to 145 junior high school. students from Forest 
Park, Illinois. The factor analysis described later 
was based upon methods outlined in Kinnear and 
Gray (1997). 

It should be intuitively obvious to the reader 
that any large battery of ability tests will reflect a 
smaller number of basic, underlying abilities (fac- 
tors). Consider the 24 tests depicted in Table 8.1. 
Surely some of these tests measure common un- 
derlying abilities. For example, we would expect 
Sentence Completion, Word Classification, and 
Word Meaning (variables 7, 8, and 9) to assess a 
factor of general language ability of some kind. In 
like manner, other groups of tests seem likely to 
measure common underlying abilities. But how 
many abilities or factors? And what is the nature of 
these underlying abilities? Factor analysis is the 
ideal tool for answering these questions. We follow 
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TABLE 8.1 The 24 Ability Tests Used by Holzinger and Swineford (1939) 
1. Visual Perception 13. Straight and Curved Capitals 
2. Cubes 14. Word Recognition 
3. Paper Form Board 15. Number Recognition 
4. Flags 16. Figure Recognition 
5. General Information 17. Object-Number 
6. Paragraph Comprehension 18. Number-Figure 
7. Sentence Completion 19. Figure-Word 
8. Word Classification 20. Deduction 
9. Word Meaning 21. Numerical Puzzles 
10. Add Digits 22. Problem Reasoning 
11. Code (Perceptual Speed) 23. Series Completion 
12. Count Groups of Dots 24. Arithmetic Problems 





the factor analysis of the Holzinger and Swineford 
(1939) data from beginning to end. 


The Correlation Matrix 


The beginning point for every factor analysis is the 
correlation matrix, a complete table of intercor- 
relations among all the variables.! The correla- 
tions between the 24 ability variables discussed 
here can be found in Table 8.2. The reader will no- 
tice that variables 7, 8, and 9 do, indeed, intercor- 
relate quite strongly (correlations of .62, .69, and 
.53), as we suspected earlier. This pattern of inter- 
correlations is presumptive evidence that these 
variables measure something in common; that is, it 
appears that these tests reflect a common underly- 
ing factor. However, this kind of intuitive factor 
analysis based upon a visual inspection of the cor- 
relation matrix is hopelessly limited; there are just 
too many intercorrelations for the viewer to discern 
the underlying patterns for all the variables. Here is 
where factor analysis can be helpful. Although we 
cannot elucidate the mechanics of the procedure, 


1. In this example, the variables are tests that produce more or 
less continuous scores. But the variables in a factor analysis can 
take other forms, so long as they can be expressed as continu- 
ous scores. For example, all of the following could be variables 
in a factor analysis: height, weight, income, social class, and 
rating-scale results. 


factor analysis relies upon modern high-speed 
computers to search the correlation matrix accord- 
ing to objective statistical rules and determine the 
smallest number of factors needed to account 
for the observed pattern of intercorrelations. The 
analysis also produces the factor matrix, a table 
showing the extent to which each test loads on (cor- 
relates with) each of the derived factors, as dis- 
cussed in the following section. 


The Factor Matrix and Factor Loadings 


The factor matrix consists of a table of correla- 
tions called factor loadings. The factor loadings 
(which can take on values from —1.00 to +1.00) in- 
dicate the weighting of each variable on each fac- 
tor. For example, the factor matrix in Table 8.3 
shows that five factors (labeled I, II, IH, IV, and V) 
were derived from the analysis. Note that the first 
variable, Series Completion, has a strong positive 
loading of .71 on factor I, indicating that this test is 
a reasonably good index of factor I. Note also that 
Series Completion has a modest negative loading of 
—.11 on factor H, indicating that, to a slight extent, 
it measures the opposite of this factor; that is, high 
scores on Series Completion tend to signify low 
scores on factor II, and vice versa. 

The factors may seem quite mysterious, but in 
reality they are conceptually quite simple. A factor 
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TABLE 8.2 The Correlation Matrix for 24 Ability Variables 
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40 
47 
32 
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31 
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41 





Note: Decimals omitted. 


Source: Reprinted with permission from Holzinger, K., & Harman, H. (1941). Factor analysis: A synthesis of factorial methods. 
Chicago: University of Chicago Press. Copyright © 1941 The University of Chicago Press. 


is nothing more than a weighted linear sum of the 
variables; that is, each factor is a precise statistical 
combination of the tests used in the analysis. Ina 
sense, a factor is produced by “adding in” carefully 
determined portions of some tests and perhaps 
“subtracting out” fractions of other tests. What 
makes the factors special is the elegant analytical 
methods used to derive them. Several different 
methods exist. These methods differ in subtle ways 
beyond the scope of this text; the reader can gather 
a sense of the differences by examining names of 
procedures: principal components factors, prin- 
cipal axis factors, method of unweighted least 


squares, maximum-likelihood method, image fac- 
toring, and alpha factoring (Tabachnick & Fidell, 
1989), Most of the methods yield highly similar 
results, 

The factor loadings depicted in Table 8.3 are 
nothing more than correlation coefficients between 
variables and factors. These correlations can be in- 
terpreted as showing the weight or loading of each 
factor on each variable. For example, variable 9, the 
test of Word Meaning, has a very strong loading. 
(.69) on factor I, modest negative loadings (—.45 
and —.29) on factors II and III, and negligible load- 
ings (.08 and .00) on factors IV and V. 
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TABLE 8.3 The Principal-Axes Factor Analysis for 24 Variables 


I II 

23. Series Completion ai -.11 
8. Word Classification .70 —.24 
5. General Information .70 —.32 
9. Word Meaning .69 —.45 

` 6. Paragraph Comprehension .69 —.42 
7. Sentence Completion —.42 


68 

24. - Arithmetic Problems 67 
20. Deduction .64 =.19 

22. Problem Reasoning 64 

62 


21. Numerical Puzzles 24 
13. Straight and Curved Capitals .62 .28 
1. Visual Perception 62 -.01 
11. Code (Perceptual Speed) 57 44 
18. Number-Figure 35 39 
16. Figure Recognition 53 .08 
4. Flags SA —.18 
17. Object-Number .49 wt 
2. Cubes 40 —.08 
12. Count Groups of Dots .48 D9 
10. Add Digits .47 55 
3. Paper Form Board 44 -.19 
14. Word Recognition 5 .09 
15. Number Recognition .42 .14 
19. Figure-Word .47 .14 


Factors 
IH IV V 
14 11 07 
—.15 —11 —.13 
—.34 —.04 08 
—.29 .08 00 
—.26 .08 —.01 
—.36 -.05 —.05 
—.23 —.04 —.11 
13 .06 28 
11 .05 —.04 
10 —.21 16 
02 —.36 —.07 
42 —.21 —.01 
—.20 .04 01 
20 .15 —.11 
40 .31 19 
32 —.23 —.02 
—.03 .47 —.24 
39 —.23 34 
—.14 —.33 11 
—.45 —.19 07 
48 —.12 —.36 
—.03 55 16 
10 52 31 
13 .20 —.61 





Geometric Representation of Factor Loadings 


It is customary to represent the first two or 
three factors as reference axes in two- or three- 
dimensional space.? Within this. framework the 
factor loadings for each variable can be plotted 
for examination. In our example, five factors were 
discovered, too many for simple visualization. 
Nonetheless, we can illustrate the value of geo- 


2. Technically, it is possible to represent all the factors as ref- 
erence axes in n-dimensional space, where n is the number of 
factors. However, when working with more than two or three 
reference axes, visual representation is no longer feasible. 


metric representation by oversimplifying some- 
what and depicting just the first two factors (Figure 
8.1). In this graph, each of the 24 tests has been 
plotted against the two factors that correspond to 
axes I and II. The reader will notice that the factor 
loadings on the first factor (I) are uniformly posi- 
tive, whereas the factor loadings on the‘second fac- 
tor (II) consist of a mixture of positive and negative. 


The Rotated Factor Matrix 


An important point in this context is that the po- 
sition of the reference axes is arbitrary. There is 
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FIGURE 8.1 
Geometric Representation of the First 
Two Factors from 24 Ability Tests 


nothing to prevent the researcher from rotating the 
axes so that they produce a more sensible fit with 
the factor loadings. For example, the reader will no- 
tice in Figure 8.1 that tests 6, 7, and 9 (all language 
tests) cluster together. It would certainly clarify the 
interpretation of factor I if it were to be redirected 
near the center of this cluster (Figure 8.2). This ma- 
nipulation would also bring factor II alongside in- 
terpretable tests 10, 11, and 12 (all number tests). 
Although rotation can be conducted manually 
by visual inspection, it is more typical for re- 
searchers to rely upon one or more objective statis- 
tical criteria to produce the final rotated factor 
matrix. Thurstone’s (1947) criteria of positive man- 
ifold and simple structure are commonly applied. 
In a rotation to positive manifold, the computer 
program seeks to eliminate as many of the negative 
factor loadings as possible. Negative factor load- 
ings make little sense in ability testing, because 





they imply that high scores on a factor are corre- 
lated with poor test performance. In a rotation to 
simple structure, the computer program seeks to 
simplify the factor loadings so that each test has 
significant loadings on as few factors as possible. 
The goal of both criteria is to produce a rotated fac- 
tor matrix that is as straightforward and unambigu- 
ous as possible. 

The rotated factor matrix for this problem is 
shown in Table 8.4. The particular method of rota- 
tion used here is called varimax rotation. Varimax 
should not be used if the theoretical expectation 
suggests that a general factor may occur. Should we 
expect a general factor in the analysis of ability 
tests? The answer is as much a matter of faith as of 
science. One researcher may conclude that a gen- 
eral factor is likely and therefore pursue a different 
type of rotation. A second researcher may be com- 
fortable with a Thurstonian viewpoint and seek 
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21 24 








multiple ability factors using a varimax rotation. 
We will explore this issue in more detail later, but 
it is worth pointing out here that a researcher en- 
counters many choice points in the process of con- 
ducting a factor analysis. It is not surprising, then, 
that different researchers may reach different con- 
clusions from factor analysis, even when they are 
analyzing the same data set. 


The Interpretation of Factors 


Table 8.4 indicates that five factors underlie the in- 
tercorrelations of the 24 ability tests. But what shall 
we call these factors? The reader may find the an- 
swer to this question disquieting, because at this 
juncture we leave the realm of cold, objective sta- 
tistics and enter the arena of judgment, insight, and 
presumption. In order to interpret or name a factor, 
the researcher must make a reasoned judgment 


FIGURE 8.2 

Geometric Representation of the First 
Two Rotated Factors from 24 Ability 
Tests 


about the common processes and abilities shared 
by the tests with strong loadings on that factor. For 
example, in Table 8.4 it appears that factor I is ver- 
bal ability, because the variables with high loadings 
stress verbal skill (e.g., Sentence Completion loads 
.86, Word Meaning loads .84, and Paragraph Com- 
prehension loads .81). The variables with low load- 
ings also help sharpen the meaning of factor I. For 
example, factor I is not related to numerical skill 
(Numerical Puzzles loads .18) or spatial skill 
(Paper Form Board loads .16). Using a similar form 
of inference, it appears that factor II is mainly nu- 
merical ability (Add Digits loads .85, Count 
Groups of Dots loads .80). Factor III is less certain 
but appears to be a visual-perceptual capacity, and 
factor IV appears to be a measure of recognition. 
We would need to analyze the single test on factor 
V (Figure-Word) to surmise the meaning of this 
factor. 
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TABLE 8.4 The Rotated Varimax Factor Matrix for 24 Ability Variables 


Factors 

I u HI IV V 

7. Sentence Completion 86 45 AS .03 .07 
9. Word Meaning 84 .06 45 .18 .08 
6. Paragraph Comprehension 81 .07 .16 .18 .10 
5. General Information ‚79 ‚22 .16 312 —.02 
8. Word Classification 65 .22 .28 .03 21 
22. Problem Reasoning 43 12 38 23 22 
10. Add Digits .18 85 -.10 .09 -.01 
12. Count Groups of Dots .02 ‚80 20 .03 00 
11. Code (Perceptual Speed) 18 64 05 30 17 
13. Straight and Curved Capitals .19 .60 40 —.05 18 
24. Arithmetic Problems Al 54 12 .16 24 
21. Numerical Puzzles .18 52 45 .16 02 
18. Number-Figure .00 40 28 38 36 
1. Visual Perception eyi 21 69 .10 20 
2. Cubes 09 09 65 12 -.18 
4. Flags 26 .07 60 -.01 15 
3. Paper Form Board 16 -.09 57 -.05 49 
23. Series Completion 42 24 52 18 11 
20. Deduction 43 11 .47 35 -.07 
15. Number Recognition 11 09 12 74 —.02 
14. Word Recognition 23 10 .00 69 10 
16. Figure Recognition 07 07 46 59 14 
17. Object-Number i15 .25 —.06 52 49 
19. Figure-Word .16 .16 Al 14 77 





Note: Boldfaced entries signify subtests loading strongly on each factor. 


These results illustrate a major use of factor 
analysis, namely, the identification of a small num- 
ber of marker tests from a large test battery. Rather 
than using a cumbersome battery of 24 tests, a re- 
searcher could gain nearly the same information by 
carefully selecting several tests with strong load- 
ings on the five factors. For example, the first factor 
is well represented by test 7, Sentence Completion 
(.86) and test 9, Word Meaning (.84); the second 
factor is reflected in test 10, Add Digits (.85), while 
the third factor is best illustrated by test 1, Visual 
Perception (.69). The fourth factor is captured 
by test 15, Number Recognition (.74) and Word 


Recognition (.69). Of course, the last factor loads 
well on only test 19, Figure-Word (.77). 


Issues in Factor Analysis 


Unfortunately, factor analysis is frequently misun- 
derstood and often misused. Some researchers ap- 
pear to use factor analysis as a kind of divining rod, 
hoping to find gold hidden underneath tons of dirt. 
But there is nothing magical about the technique. 
No amount of statistical analysis can rescue data 
based on trivial, irrelevant, or haphazard measures. 
If there is no gold to be found, then none will be 
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found; factor analysis is not alchemy. Factor analy- 
sis will yield meaningful results only when the re- 
search was meaningful to begin with. 

An important point is that a particular kind of 
factor can emerge from factor analysis only if the 
tests and measures contain that factor in the first 
place. For example, a short-term memory factor can- 
not possibly emerge from a battery of ability tests if 
none of the tests requires short-term memory. In 
general, the quality of the output depends upon the 
quality of the input. We can restate this point as the 
acronym GIGO, or “garbage in, garbage out.” 

Sample size is crucial to a stable factor analysis. 
Comrey (1973) offers the following rough guide: 


Sample Size Rating 
50 Very poor 
100 ; Poor 

200 Fair 

300 Good 

500 Very good 
1,000 Excellent 


In general, it is comforting to have at least five sub- 
jects foreach test or variable (Tabachnick & Fidell, 
1989). 

Finally, we cannot overemphasize the extent to 
which factor analysis is guided by subjective choices 
and theoretical prejudices. A crucial question in this 
regard is the choice between orthogonal axes and 
oblique axes. With orthogonal axes, the factors are 
at right angles to one another, which means that they 
are uncorrelated (Figures 8.1 and 8.2 both depict or- 
thogonal axes). In many cases.the clusters of factor 
loadings are situated such that oblique axes provide 
a better fit. With oblique axes, the factors are cor- 
related among themselves. Some researchers con- 
tend that oblique axes should always be used, 
whereas others take a more experimental approach. 
Tabachnick and Fidell (1989) recommend an ex- 
ploratory strategy based on repeated factor analyses. 
Their approach is unabashedly opportunistic: 


During the next few runs, researchers experiment 
with different numbers of factors, different extrac- 
tion techniques, and both orthogonal and oblique 
rotations. Some number of factors with some 
combination of extraction and rotation produces 


the solution with the greatest scientific utility, 
consistency, and meaning; this is the solution that 
is interpreted. 


With oblique rotations it is also possible to fac- 
tor analyze the factors themselves. Such a proce- 
dure may yield one or more second-order factors, 
Second-order factors can provide support for the 
hierarchical organization of traits and may offer a 
rapprochement between ability theorists who posit 
a single general factor (e.g., Spearman) and those 
who promote several group factors (e.g., Thur- 
stone). Perhaps both camps are correct, with the 
group factors sitting underneath the second-order 
general factor. 


[||| MULTIPLE apTrruDE TEST BATTERIES 


As previously noted, aptitude tests did not flourish 
until the prerequisite statistical tools—factor-ana- 
lytic procedures—were available. One of the major 
applications of factor analysis was the development 
of multiple aptitude test batteries. In a multiple ap- 
titude test battery, the examinee is tested in several 
separate, homogeneous aptitude areas. The devel- 
opment of the subtests is dictated by the findings of 
factor analysis. For example, Thurstone developed 
one of the first multiple aptitude test batteries, the 
Primary Mental Abilities Test, a set of seven tests 
chosen on the basis of factor analysis (Thurstone, 
1938). 

More recently, several multiple aptitude test 
batteries have gained favor for educational and ca- 
reer counseling, vocational placement, and armed 
services classification (Gregory, 1994a). Each year 
hundreds of thousands of persons are administered 
one of these prominent batteries: The Differential 
Aptitude Test (DAT), the General Aptitude Test 
Battery (GATB), and the Armed Services Voca- 
tional Aptitude Battery (ASVAB). These batteries 
either used factor analysis directly for the delin- 
eation of useful subtests or were guided in their 
construction by the accumulated results of other 
factor-analytic research. The salient characteristics 
of each battery are briefly reviewed in the follow- 
ing sections. 
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The Differential Aptitude Test (DAT) 


The DAT was first issued in 1947 to provide a basis 
for the educational and vocational guidance of stu- 
dents in grades seven through twelve. Subse- 
quently, examiners have found the test useful in the 
vocational counseling of young adults out of school 
and in the selection of employees. Now in its fifth 
edition (1992), the test has been periodically re- 
vised and stands as one of the most popular multi- 
ple aptitude test batteries of all time (Bennett, 
Seashore, & Wesman, 1982, 1984). 
The DAT consists of eight independent tests: 


. Verbal Reasoning (VR) 

. Numerical Reasoning (NR) 

. Abstract Reasoning (AR) 

. Perceptual Speed and Accuracy (PSA) 
. Mechanical Reasoning (MR) 

. Space Relations (SR) 

. Spelling (S) 

. Language Usage (LU) 


SNIDMHPWNYD — 


A characteristic item from each test is shown in 
Figure 8.3. 

The authors chose the areas for the eight tests 
based on experimental and experiential data rather 
than relying upon a formal factor analysis of their 
own. In constructing the DAT, the authors were 
guided by several explicit criteria: 


Each test should be an independent test: There 
are situations in which only part of the battery is 
required or desired. 

The tests should measure power: For most voca- 
tional purposes to which test results contribute, 
the evaluation of power—solving difficult prob- 
lems with adequate time—is of primary concern. 
The test battery should yield a profile: The eight 
separate scores can be converted to percentile 
ranks and plotted on a common profile chart. 
The norms should be adequate: In the fifth edi- 
tion, the norms are derived from 100,000 stu- 
dents for the fall standardization, 70,000 for the 
spring standardization. 

The test materials should be practical: With time 
limits of 6 to 30 minutes per test, the entire DAT 
can be administered in a morning or an afternoon 
school session. 

The tests should be easy to administer: Each test 
contains excellent “warm up” examples and can 
be administered by persons with a minimum of 
special training. 

Alternate forms should be available: For pur- 
poses of retesting, the availability of alternate 
forms (currently forms C and D) will reduce any 
practice effects. 


The reliability of the DAT is generally quite 
high, with split-half coefficients largely in the .90s 





VERBAL REASONING 


Choose the correct pair of words to fill in the blanks. 





A. vision — sound 

B. iris — hear 

C. retina — ear 
NUMERICAL ABILITY 

Choose the correct answer. 

4(-5) (-3) = 

A.-60 B.27 C. -27 


is to eye as eardrum is to ___ 


. sight — cochlea 
eyelash — earlobe 
_D.60 E. none of these 





FIGURE 8.3 Differential Aptitude Tests and Characteristic Items 
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ABSTRACT REASONING 
The four figures in the row to the left make a series. Find the single choice on the right 
that would be next in the series. 


< <>> <<>> <<>>>> <> <<<>> <<<>>>> <<<<>>>> 
A B Ka D 


CLERICAL SPEED AND ACCURACY 
In each test item, one of the combinations is underlined. Mark the same combination on the 
answer sheet. 


1. AB Ab AA BA Bb 2. 5m 5M M5 Mm m5 


Ab Bb AA BA AB M5 m5 Mm 5m 5M 
1: (ONS DOF TONJO Zoo = Qi Ol O O 


MECHANICAL REASONING 
Which lever will require more force to lift an object of the same weight? If equal, mark C. 





A C (equal) ; B 


SPACE RELATIONS 
Which of the figures on the right can be made by folding the pattern at the left? The pattern 
always displays the outside of the figure. 


ape © @ & 


SPELLING 
Mark whether each word is spelled right or wrong. 


1. irelevant R Ww 

2. parsimonious R Ww 

3. excellant R WwW 3 
LANGUAGE USAGE 


Decide which part of the sentence contains an error and mark the corresponding letter on 
the answer sheet. Mark N (None) if there is no error. 


In spite of public criticism, / the researcher studied / 


A B 
the affects of radiation / on plant growth. 
£ D 





FIGURE 8.3 continued 
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and alternate-forms reliabilities ranging from .73 to 
.90, with a median of .83. Mechanical Reasoning is 
an exception, with reliabilities as low as .70 for girls. 
The tests show a mixed pattern of intercorrelations 
with each other, which is optimistically interpreted 
by the authors as establishing the independence of 
the eight tests. Actually, many of the correlations are 
quite high and it seems likely that the eight tests re- 
flect a smaller number of ability factors. Certainly, 
the Verbal Reasoning and Numerical Reasoning 
tests measure a healthy general factor, with correla- 
tions around .70 in various samples. 

The manual presents extensive data demonstrat- 
ing that the DAT tests, especially the VR + NR com- 
bination, are good predictors of other criteria such as 
school grades and scores on other aptitude tests (cor- 
relations in the .60s and .70s). For this reason, the 
combination of VR + NR often is considered an 
index of scholastic aptitude. Evidence for the differ- 
ential validity of the other tests is rather slim. Ben- 
nett, Seashore, and Wesman (1974) do present 
results of several follow-up studies correlating vo- 
cational entry/success with DAT profiles, but their 
research methods are more impressionistic than 
quantitative; the independent observer will find it 
difficult to make use of their results. Schmitt (1995) 
notes that a major problem with the battery is the 


lack of discriminant validity between the eight sub- 
tests. With the exception of the Perceptual Speed 
and Accuracy test, all of the subscales are highly in- 
tercorrelated (.50 to .75). If one wants only a gen- 
eral index of the person’s academic ability, this is 
fine; if the scores on the subtests are to be used in 
some diagnostic sense, this level of intercorrelation 
makes statements about students’ relative strengths 
and weaknesses highly questionable. 


Even so, the revised DAT is better than previous 
editions. One significant improvement is the elim- 
ination of apparent sex bias on the Language Usage 
and Mechanical Reasoning tests—a source of crit- 
icism from earlier reviews. The DAT has been 
translated into several languages and is widely used 
in Europe for vocational guidance and research ap- 
plications (e.g., Nijenhuis, Evers, & Mur, 2000; 
Colom, Quiroga, & Juan-Espinosa, 1999). 


The General Aptitude Test Battery (GATB) 


In the late 1930s, the U.S. Department of Labor de- 
veloped aptitude tests to predict job performance in 
100 specific occupations. In the 1940s, the depart- 
ment hired a panel of experts in measurement and 
industrial-organizational psychology to create a 
multiple aptitude test battery to assess the 100 oc- 
cupations previously studied and many more. The 
outcome of this Herculean effort was the General 
Aptitude Test Battery (GATB), widely acknowl- 
edged as the premiere test battery for predicting job 
performance (Hunter, 1994). 

The GATB was derived from a factor analysis 
of 59 tests administered to thousands of male 
trainees in vocational courses (United States Em- 
ployment Service, 1970). The interpretive stan- 
dards have been periodically revised and updated, 
so the GATB is a thoroughly modern instrument 
even though its content is little changed. One limi- 
tation is that the battery is available mainly to state 
employment offices, although nonprofit organiza- 
tions, including high schools and certain colleges, 
can make special arrangements for its use. 

The GATB is composed of eight paper-and- 
pencil tests and four apparatus measures. The en- 
tire battery can be administered in approximately 
two and a half hours and is appropriate for high 
school seniors and adults. The twelve tests yield a 
total of nine factor scores: 


« General Learning Ability (intelligence) (G). 
This score is a composite of Vocabulary, Arith- 
metic Reasoning, and Three-Dimensional Space. 
Verbal Aptitude (V). Derived from a Vocabu- 
lary test that requires the examinee to indicate 
which two words in a set are either synonyms or 
antonyms. 

Numerical Aptitude (N). This score is a com- 
posite of both the Computation and Arithmetic 
Reasoning tests. 

Spatial Aptitude (S). Consists of the Three-Di- 
mensional Space test, a measure of the ability to 
perceive two-dimensional representations of 
three-dimensional objects and to visualize move- 
ment in three dimensions. 
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Form Perception (P). This score is acomposite 
of Form Matching and Tool Matching, two tests 
in which the examinee must match identical 
drawings. 

Clerical Perception (Q). A proofreading test 
called Name Comparison, the examinee must 
match names under pressure of time. 

Motor Coordination (K). Measures the ability 
to quickly make specified pencil marks in the 
Mark Making test. 

Finger Dexterity (F). A composite of the As- 
semble and Disassemble tests, two measures of 
dexterity with rivets and washers. 

Manual Dexterity (M). A composite of Place 
and Turn, two tests requiring the examinee to 
transfer and reverse pegs in a board. 


The nine factor scores on the GATB are ex- 
pressed as standard scores with a mean of 100 and 
an SD of 20. These standard scores are anchored to 
the original normative sample of 4,000 workers ob- 
tained in the 1940s. Alternate-forms reliability co- 
efficients for factor scores range from the .80s to 
the .90s. The GATB manual summarizes several 
studies of the validity of the test, primarily in terms 
of its correlation with relevant criterion measures. 
Hunter (1994) notes that GATB scores predict 
training success for all levels of job complexity. The 
average validity coefficient is a phenomenal .62. 


The absolute scores are of less interest than 
their comparison to updated Occupational Aptitude 
Patterns (OAPs) for dozens of occupations. Based 
on test results for huge samples of applicants and 
employees in different occupations, counselors and 
employers now have access to a wealth of infor- 
mation about score patterns needed for success ina 
variety of jobs. Thus, one way of using the GATB 
is to compare an examinee’s scores with OAPs 
believed necessary for proficiency in various 
occupations. 

Hunter (1994) recommends an alternative strat- 
egy based on composite aptitudes (Figure 8.4). The 
nine specific factor scores combine nicely into 
three general factors: Cognitive, Perceptual, and 
Psychomotor. Hunter notes that different jobs re- 
quire various contributions of the Cognitive, Per- 
ceptual, and Psychomotor aptitudes. For example, 
an assembly line worker in an automotive plant 
might need high scores on the Psychomotor and 
Perceptual composites, whereas the Cognitive 
score would be less important for this occupation. 
Hunter’s research demonstrates that general factors 
dominate over specific factors in the prediction of 
job performance. Davison, Gasser, and Ding (1996) 
discuss additional approaches to GATB profile 
analysis and interpretation. 

Van de Vijver and Harsveld (1994) investigated 
the equivalence of their computerized version of the 
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GATB with the traditional paper-and-pencil version. 
Of course, only the cognitive and perceptual sub- 
tests were compared—tests of motor skills cannot 
be computerized. They found that the two versions 
were not equivalent. In particular, the computerized 
subtests produced faster and more inaccurate re- 
sponses than the conventional subtests. Their re- 
search demonstrates once again that the equivalence 
of traditional and computerized versions of a test 
should not be assumed. This is an empirical question 
answerable only with careful research. Nijenhuis 
and van der Flier (1997) discuss a Dutch version of 
the GATB and its application in the study of cogni- 
tive differences between immigrants and majority 
group members in the Netherlands. 


The Armed Services Vocational 
Aptitude Battery (ASVAB) 


The ASVAB is probably the most widely used ap- 
titude test in existence. This instrument is used by 
the Armed Services to screen potential recruits and 
to assign personnel to different jobs and training 
programs. The ASVAB is also available in acom- 
puterized version that is rapidly supplanting the 
original paper-and-pencil test (Segall & Moreno, 
1999). The computerized ASVAB is discussed in 
more detail at the end of this section. More than 2 
million examinees take the AS VAB each year. The 
current version consists of ten subtests, four of 


which produce the Armed Forces Qualification 
Test (AFQT), the common qualifying exam for all 
services (Table 8.5). Eight subtests are power tests 
with adequate time limits for most subjects, 
whereas two subtests (Numerical Operations and 
Coding Speed) are speeded tests that place a pre- 
mium upon rapid performance. Alternate-forms 
reliability coefficients for ASVAB scores are in 
the mid-.80s to mid-.90s, and test-retest coeffi- 
cients range from the mid-.70s to the mid-.80s 
(Larson, 1994). The one exception is Paragraph 
Comprehension with a reliability of only .50. The 
test is well normed on a representative sample of 
12,000 persons between the ages of 16 and 23 
years. The ASVAB manual reports a median 
validity coefficient of .60 with measures of train- 
ing performance. 

Decisions about ASVAB examinees are typi- 
cally based upon composite scores, not subtest 
scores. For example, a Clerical Composite is de- 
rived by combining Word Knowledge, Paragraph 
Comprehension, Numerical Operations, and Cod- 
ing Speed. Subjects scoring well on this composite 
might be assigned to secretarial positions. Since the 
composite scores are empirically derived, new ones 
can be developed for placement decisions at any 
time. Composite scores are continually updated and 
revised. For example, Ree and Carretta (1994) ad- 
vocated three composites derived from a factor 
analysis of more than 11,000 participants in the 


TABLE 8.5 The Armed Services Vocational Aptitude Battery (ASVAB) Subtests 


Arithmetic Reasoning* 
Mathematics Knowledge* 
Paragraph Comprehension* 
Word Knowledge* 

Coding Speed 

General Science 

Numerical Operations 
Electronics Information 
Mechanical Comprehension 
Auto and Shop Information 


30-item test of arithmetic word problems based upon simple calculation 
25-item test of algebra, geometry, fractions, decimals, and exponents 
15-item test of reading comprehension in short paragraphs 

35-item test of vocabulary knowledge and synonyms 

84-item speeded test of substitution of numeric for verbal codes 
25-item test of general knowledge in physical and biological science 
50-item speeded test of ability to add, subtract, multiply, and divide 
20-item test of electronics, radio, and electrical principles 

25-item test of mechanical and physical principles 

25-item test of basic knowledge of autos, shop practices, and tool usage 





*Armed Forces Qualifying Test (AFQT). 
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ASVAB testing. These composites and their con- 
stitutent tests were as follows: 


1. Speed: Numerical Operations and Coding Speed 

2. Verbal/Math: Arithmetic Reasoning, Word 
Knowledge, Paragraph Comprehension, and 
Mathematics Knowledge 

3. Technical Knowledge: General Science, Auto 
and Shop Information, Mechanical Comprehen- 
sion, and Electronics Information 


The reader will notice that the second factor is iden- 
tical to the AFQT, mentioned previously. 

At one point, the Armed Services relied heavily 
upon the seven composites in the following list 
(Murphy, 1984). The first three constitute academic 
composites, whereas the remaining are occupational 
composites. The reader will notice that individual 
subtests may appear in more than one composite: 


1. Academic Ability: Word Knowledge, Paragraph 
Comprehension, and Arithmetic Reasoning 

2. Verbal: Word Knowledge, Paragraph Compre- 
hension, and General Science 

3. Math: Mathematics Knowledge and Arithmetic 
Reasoning | 

4. Mechanical and Crafts: Arithmetic Reasoning, 
Mechanical Comprehension, Auto and Shop In- 
formation, and Electronics Information 

5. Business and Clerical: Word Knowledge, Para- 
graph Comprehension, Mathematics Knowl- 
edge, and Coding Speed 

6. Electronics and Electrical: Arithmetic Reason- 
ing, Mathematics Knowledge, Electronics In- 
formation, and General Science 

7. Health, Social, and Technology: Word Knowl- 
edge, Paragraph Comprehension, Arithmetic 
Reasoning, and Mechanical Comprehension 


The problem with forming composites in this 
manner is that they are so highly correlated with one 
another as to be essentially redundant. In fact, the av- 
erage intercorrelation among these seven composite 
scores is .86! (Murphy, 1984). Clearly, composites 
do not always provide differential information about 
specific aptitudes. Perhaps that is why recent edi- 
tions of the ASVAB have steered clear of multiple, 


complex composites. Instead, the emphasis is on 
simpler composites that are composed of highly 
related constructs. For example, a Verbal Ability 
composite is derived from Word Knowledge and 
Paragraph Comprehension, two highly interrelated 
subtests. In like manner, a Math Ability composite is 
obtained from the combination of Arithmetic Rea- 
soning and Mathematics Knowledge. 

Some researchers have concluded that the 
ASVAB does not function as a multiple aptitude test 
battery, but achieves success in predicting diverse 
vocational assignments because the composites in- 
variably tap a general factor of intelligence. For ex- 
ample, Dunai and Porter (2001) report favorably on 
the ASVAB as a predictor of entry-level success of 
radiography students in Air Force medical training. 
The ASVAB may be a good test of general intelli- 
gence, but it falls short as a multiple aptitude test 
battery. Another concern is that the test may possess 
different psychometric structures for men and 
women. Specifically, the Electronics Information 
subtest is a good measure of g (the general factor of 
intelligence) for men, but not women (Ree & Car- 
retta, 1995). The likely explanation for this is that 
men are about nine times more likely to enroll in 
high school classes in electronics and auto shop, and 
men therefore have the opportunity for their general 
ability to shape what they learn about electronics in- 
formation, whereas women do not. Scores on this 
subtest will therefore function as a measure of 
achievement (what has already been learned) but not 
as an index of aptitude (forecasting future results). 

Research on a computerized adaptive testing 
(CAT) version of the ASVAB has been underway 
since the 1980s. Computerized adaptive testing 
is discussed in Topic 15A, Computerized Assess- 
ment and the Future of Testing. We provide a brief 
overview here. In CAT, the examinee takes the test 
while sitting at a computer terminal. The difficulty 
level of the items presented on the screen is con- 
tinually readjusted as a function of the examinee’s 
ongoing performance. In general, an examinee who 
answers a subtest item correctly will receive a 
harder item, whereas an examinee who fails that 
item will receive an easier item. The computer 
uses item response theory as a basis for selecting 
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items. Each examinee receives a unique set of test 
items tailored to his or her ability level. 

In 1990, the CAT-ASVAB began to replace the 
paper-and-pencil ASVAB. Currently, more than 
two-thirds of all military applicants are tested with 
the computerized version. Larson (1994) lists the 
reasons for adopting the CAT-ASVAB as follows: 


1. Shorten overall testing time (adaptive tests re- 
quire roughly one-half the items of standard 
tests). 

2. Increase test security by eliminating the possi- 
bility that test booklets could be stolen. 

3. Increase test precision at the upper and lower 
ability extremes. 

4. Provide a means for immediate feedback on 
test scores, since the computers used for testing 
can immediately score the tests and output the 
results. 

5. Provide a means for flexible test start times (un- 
like group-administered paper-and-pencil tests, 
for which everyone must start and stop at the 
same time, computer-based testing can be tai- 
lored to the examinees’ personal schedules). 


Reliability and validity studies of the CAT- 
ASVAB provide strong support for its equivalence 
to the original test. In general, the computerized 
version of the instrument measures the same con- 
structs as its paper-and-pencil counterpart—and 
does so in less time and with greater precision 
(Moreno & Segall, 1997; Segall, 1997). With the 
success of this project, the CAT-ASVAB and other 
tests likely will be expanded to measure new as- 
pects of performance such as response latencies 
and to display unique items types such as visu- 
ospatial tests of objects in motion (Larson, 1994). 
The CAT-ASVAB has the potential to change the 
future of testing. 


PREDICTING COLLEGE 
PERFORMANCE 


As most every college student knows, a major use 
of aptitude tests is the prediction of academic per- 
formance. In most cases, applicants to college must 


contend with the Scholastic Assessment Tests 
(SAT) or the American College Test (ACT) assess- 
ment program. Institutions may set minimum stan- 
dards on the SAT or ACT tests for admission, based 
on the knowledge that low scores foretell college 
failure. In this section we will explore the technical 
adequacy and predictive validity of the major col- 
lege aptitude tests. 


The Scholastic Assessment Tests (SAT) 


Formerly known as the Scholastic Aptitude Tests, 
the Scholastic Assessment Tests, or SAT, is the old- 
est of the college admissions tests, dating back to 
1926. The SAT is published by the College Board 
(formerly the College Entrance Examination 
Board), a group formed in 1899 to provide a na- 
tional clearinghouse for admissions testing. As 
noted by historian Fuess (1950), the purpose of a 
nationally based admissions test was “to introduce 
law and order into an educational anarchy which 
towards the close of the nineteenth century had be- 
come exasperating, indeed almost intolerable, to 
schoolmasters.” Over the years, the test has been 
extensively revised, continuously updated, and re- 
peatedly renormed. In the early 1990s, the SAT was 
renamed the Scholastic Assessment Tests to em- 
phasize changes in content and format. The new 
SAT assesses mastery of high school subject mat- 
ter to a greater extent than its predecessor, but con- 
tinues to tap reasoning skills. The SAT represents 
state of the art for aptitude testing. 

The new SAT consists of the SAT-I Reasoning 
Tests and the SAT-II Subject Tests. The SAT-I 
Verbal Reasoning Test emphasizes vocabulary in 
context, reading comprehension, and critical rea- 
soning. The SAT-I Math Reasoning emphasizes the 
application of mathematical concepts, the inter- 
pretation of data, and the actual construction of a 
response, as opposed to the typical multiple-choice 
format. A calculator is highly recommended but 
not required. 

The SAT-I Verbal Reasoning and Math Rea- 
soning scores are reported on a scale that ranges 
from 200 to 800. Characteristic item types for the 
Verbal portion include the following: 
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Analogies: Select a pair of words that best ex- 
presses a relationship similar to that expressed in 
a stimulus pair. 

Sentence Completions: For a sentence with one 
or two blanks, choose a word or pair of words 
that best fits the meaning of the sentence as a 
whole. 

Reading Comprehension: Read a passage and 
answer multiple-choice questions based on what 
is stated or implied in the passage. 


Characteristic item types for the Math portion in- 
clude the following: 


¢ Regular Mathematics: Solve basic problems in 
geometry and algebra. 

« Quantitative Comparisons: Choose from two 
quantities which is greater, or denote that they are 
equal, or denote that the problem is unsolvable 
from the information given. 


A persistent misconception about SAT scores is 
that 500 and 100 represent the mean and standard 
deviation of the most recent sample of SAT test tak- 
ers (Donlon, 1984). In fact, the numbers 500 and 
100 refer to the mean and standard deviation of the 
anchor group of 10,654 students who took the Ver- 
bal portion of the SAT in April 1941. (The Mathe- 
matics portion was equated to this. verbal portion 
the next year.) All new scores are equated to the an- 
chor scores by linking each new form of the SAT to 
one or more previous forms. For example, if a new 
form is slightly easier than previous forms, the test 
taker may need a few more correct answers in order 
to attain an equivalent score. This procedure guar- 
antees that current SAT scores are based on the 
same measurement scale used at the inception of 
the anchoring procedure in 1941.? A rescaling and 
repositioning of SAT scores was scheduled for 
1996. One purpose of the rescaling was to provide 
more reliable measurement in upper and lower 


3. This practice differs from the procedure followed in the 
renorming of IQ tests. When a revision of an IQ test is newly 
standardized, the mean for the population is always set at 100, 
regardless of the comparative difficulty of the revised test and 
its predecessor. 


score ranges by widening the item difficulty level 
(Johnson, 1994). 

From year to year, the average score for SAT 
test takers may be substantially different from the 
original average of 500. In fact, SAT scores de- 
clined precipitously from 1963 to 1980. By 1980, 
the Mathematics average had declined from 500 to 
about 465, while the Verbal average reached a low 
of 420, nearly a full standard deviation below its 
starting point. Average scores on both scales have 
increased only slightly since then. This phenome- 
non has been the subject of intense scrutiny, and it 
is beyond the scope of this text to review all the ex- 
planations that have been profferred. The following 
findings are generally accepted: 


The decline was not an artifact of SAT difficulty 
or scaling; it was a real phenomenon that affected 
other major testing programs. 

The decline was significant, representing a size- 
able shift in test performance; the change repre- 
sented a “serious deterioration of the learning 
process in America” (Wirtz, 1977). 

The decline did not lessen the predictive validity 
of the SAT; the test continued to correlate well 
with college performance. 

Population shifts such as increases in family size 
may explain part of the decline; if average fam- 
ily size continues to decrease, SAT scores are 
predicted to increase (Zajonc, Markus, & 
Markus, 1979). 

Social changes such as the expansion of televi- 
sion may have contributed to the decline; how- 
ever, such hypotheses are difficult to prove 
(Donlon, 1984). 


Great care is taken in the construction of new 
forms of the SAT Verbal and Math tests because un- 
failing reliability and a high degree of parallelism 
are essential to the mission of this testing program. 
The internal consistency reliability of both forms is 
repeatedly in the range of .91 to .93; with only a 
few exceptions, test-retest correlations vary be- 
tween .87 and .89. The standard error of measure- 
ment is 30 to 35 points. 

The primary evidence for SAT validity is 
criterion-related, in this case, the ability to predict 
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first-year college grades. Donlon (1984, chap. VIII) 
reports a wealth of information on this point; we 
can only summarize trends here. In 685 studies, the 
combined SAT Verbal and Math scores correlated 
.42, on average, with college first-year grade point 
average. Interestingly, high school record (e.g., 
rank or grade point average) fares better than the 
SAT in predicting college grades (r = .48). But the 
combination of SAT and high school record proves 
even more predictive; these variables correlated 
.55, on average, with college first-year grade point 
average. Of course, these findings reflect a sub- 
stantial restriction of range: low SAT-scoring high 
school students tend not to attend college. Donlon 
(1984) estimates that the real correlation without 
restriction of range (SAT + high school record) 
would be in the neighborhood of .65. 

One issue of great practical concern is the effect 
of special study and coaching on SAT scores. Does 
it help to receive special coaching on vocabulary 
and mathematics or to read the numerous prepara- 
tion guides available in most any bookstore? Mes- 
sick and Jungeblut (1981) reviewed the available 
studies that employed an experimental versus con- 
trol group format. They concluded that coaching 
boosts the combined Verbal and Math scores about 
28 to 30 points, not a substantial increase compared 
to no coaching/preparation. However, for highly 
motivated students who seek out coaching and 
receive a rigorous, structured program, coaching 
effects are much larger, 45 to 110 points on the 
combined Verbal and Math scores (Johnson, 1994). 
A related issue pertains to the sizeable proportion of 
students who take the SAT more than once. Do 
scores tend to rise with repeated testing? In cases in 
which the retesting occurs within five to eight 
months, the average increase is about 12 points each 
for the Verbal and Mathematics scores (Donlon, 
1984). The increase reflects, in part, familiarity with 
the test, but a factor overlooked by many is the ac- 
tive learning that might take place in the interim. 


The American College Test (ACT) 


The American College Test (ACT) assessment pro- 
gram is a recent program of testing and reporting 


designed for college-bound students. In addition to 
traditional test scores, the ACT assessment pro- 
gram includes a brief 90-item interest inventory 
(based on Holland’s typology) and a student profile 
section (in which the student may list subjects stud- 
ied, notable accomplishments, work experience, 
and community service). We will not discuss these 
ancillary measures here, except to note that they are 
useful in generating the Student Profile Report, 
which is sent to the examinee and the colleges 
listed on the registration folder. 

Initiated in 1959, the ACT is based on the philos- 
ophy that direct tests of the skills needed in college 
courses provide the most efficient basis for predict- 
ing college performance. In terms of the number of 
students who take it, the ACT occupies second place 
behind the SAT as a college admissions test. The 
four ACT tests require knowledge of a subject area, 
but emphasize the use of that knowledge: 


English (75 questions, 45 minutes). The exami- 
nee is presented with several prose passages ex- 
cerpted from published writings. Certain portions 
of the text are underlined and numbered, and pos- 
sible revisions for the underlined sections are pre- 
sented; in addition, “no change” is one choice. 
The examinee must choose the best option. 
Mathematics (60 questions, 60 minutes). Here 
the examinee is asked to solve the kinds of math- 
ematics problems likely to be encountered in 
basic college mathematics courses. The test em- 
phasizes concepts rather than formulas and uses 
a multiple-choice format. 

Reading (40 questions, 35 minutes). This subtest 
is designed to assess the examinee’s level of 
reading comprehension; subscores are reported 
for social studies/sciences and arts/literature 
reading skills. 

Science Reasoning (40 questions, 35 minutes). 
This test assesses the ability to read and under- 
stand material in the natural sciences. The ques- 
tions are drawn from data representations, 
research summaries, and conflicting viewpoints. 


In addition to the area scores listed previously, 
ACT results are also reported as an overall Com- 
posite score, which is the average of the four tests. 
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ACT scores are reported on a standard score 36- 
point scale. In 2002, the average ACT Composite 
score of high school graduates was 20.8, with a 
standard deviation of about 5 points (Maxey, 1994). 
However, like the SAT, scores on the ACT are not 
fixed in any given year. ACT scores showed the 
same decline in the 1960s and 1970s as observed on 
the SAT. 

Critics of the ACT program have pointed to the 
heavy emphasis upon reading comprehension that 
saturates all four tests. The average intercorrela- 
tion of the tests is typically around .60. These data 
suggest that a general achievement/ability factor 
pervades all four tests; results for any one test 
should not be overinterpreted. Fortunately, college 
admission officers probably place the greatest em- 
phasis upon the Composite score, which is the av- 
erage of the four separate tests. The ACT test 
appears to measure much the same thing as the 
SAT; the correlation between these two tests ap- 
proaches .90. It is not surprising, then, that the pre- 
dictive validity of the ACT Composite score rivals 
the SAT combined score, with correlations in the 
vicinity of .40, to .50 with college first-year grade 
point average. The predictive validity coefficients 
are virtually identical for advantaged and disad- 
vantaged students, indicating that the ACT tests are 
not biased. 

Kifer (1985) does not question the technical ad- 
equacy of the ACT and similar testing programs, 
but does protest the enormous symbolic power 
these tests have accrued. The heavy emphasis upon 
test scores for college admissions is not a technical 
issue, but a social, moral, and political concern: 


Selective admissions means simply that an institu- 
tion cannot or will not admit each person who 
completes an application. Choices of who will or 
will not be admitted should be, first of all, a matter 
of what the institution believes is desirable and 
may or may not include the use of prediction equa- 
tions. It is just as defensible to select on talent 
broadly construed as it is to use test scores however 
high. There are talented students in many areas— 
leaders, organizers, doers, musicians, athletes, sci- 
ence award winners, opera buffs—who may have 
moderate or low ACT scores but whose presence 
on a campus would change it. 


The reader may wish to review Topic 7B, Test Bias 
and Other Controversies, for further discussion of 
this point. 


Il] PosTGRADUATE SELECTION TESTS 


Graduate and professional programs also rely heav- 
ily upon aptitude tests for admission decisions. Of 
course, many other factors are considered when se- 
lecting students for advanced training, but there is 
no denying the centrality of aptitude test results in 
the selection decision. For example, Figure 8.5 de- 
picts a fairly typical quantitative weighting system 
used in evaluating applicants for graduate training 
in psychology. The reader will notice that an over- 
all score on the Graduate Record Exam (GRE) re- 
ceives the single highest weighting in the selection 
process. We review the GRE in the following sec- 
tions, as well as admission tests used by medical 
schools and law schools. 


Graduate Record Exam (GRE) 


The GRE is a multiple-choice and essay test widely 
used by graduate programs in many fields as one 
component in the selection of candidates for ad- 
vanced training. The GRE offers subject examina- 
tions in many fields (e.g., Biology, Computer 
Science, History, Mathematics, Political Science, 
Psychology), but the heart of the test is the general 
test designed to measure verbal, quantitative, and 
analytical writing aptitudes. The verbal section 
(GRE-V) includes verbal items such as analogies, 
sentence completion, antonyms, and reading com- 
prehension. The quantitative section (GRE-Q) con- 
sists of problems in algebra, geometry, reasoning, 
and the interpretation of data, graphs, and dia- 
grams. The analytical writing section (GRE-AW) 
was added in October 2002 as a measure of higher- 
level critical thinking and analytical writing skills. 
It consists of two writing tasks: a 45-minute essay 
in which the applicant takes a position on an issue, 
and a 30-minute essay in which the applicant ana- 
lyzes an argument. This new addition to the GRE 
replaced a multiple-choice test of analytical think- 
ing that is no longer used. 
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GRE Scores 
GRE-V + GRE-Q total: 


Undergraduate GPA 
Psychology GPA 


Background in 
Statistics/Experimental 


Background in 
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Background in 
Math/Computer Science 


Research Experience 


Positive Interpersonal 
FIGURE 8.5 Skills 
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Used by Graduate 
Program Admission 
Committees in Psychology 


The first two scores (GRE-V and GRE-Q) are 
reported as standard scores with an approximate 
mean of 500 and standard deviation of 100. Actu- 
ally, the mean score may differ from year to year 
because all test results are anchored to a standard 
reference group of 2,095 college seniors tested in 
1952 on the verbal and quantitative portions of the 
test. Historically, graduate programs have tended to 
pay attention to the combination of scores on the 
first two parts (GRE-V + GRE-Q), where com- 
bined scores above 1,000 would be considered 
above average. Recently, graduate programs have 
paid more attention to writing skills in their appli- 
cants, which explains the addition of the analytical 
writing section (GRE-AW) to the test. 

Scoring of the analytical writing section is based 
on 6-point holistic ratings provided independently 
by two trained raters. If the two scores differ by 
more than one point on the scale, the discrepancy is 
adjudicated by a third GRE-AW reader. According 
to the GRE Board (www.gre.org), the GRE-AW test 
reveals smaller ethnic group differences than found 
in the multiple-choice sections. For example, the 
differences between African American and Cau- 
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casian examinees and between Hispanic and Cau- 
casian examinees are smaller on the GRE-AW than 
on the GRE-V or GRE-Q. This suggests that the 
new test does not unduly penalize ethnic groups tra- 
ditionally underrepresented in graduate programs. 
The reliability of the GRE is strong, with inter- 
nal consistency reliability coefficients typically 
around .90 for the three components. The validity of 
the GRE commonly has been examined in relation 
to the ability of the test to predict performance in 
graduate school. Performance has been operational- 
ized mainly as grade point average, although faculty 
ratings of student aptitude also have been used. For 
example, based upon a meta-analytic review of 22 
studies with a total of 5,186 students, Morrison and 
Morrison (1995) concluded that GRE-V correlated 
.28 and GRE-Q correlated .22 with graduate grade 
point average. Thus, on average, GRE scores ac- 
counted for only 6.3 percent of the variance in grad- 
uate-level academic performance. In a recent study 
of 170 graduate students in psychology at Yale Uni- 
versity, Sternberg and Williams (1997) also found 
minimal correlations between GRE scores and grad- 
uate grades. When GRE scores were correlated with 
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faculty ratings on five variables (analytical, creative, 
practical, research, and teaching abilities), the cor- 
relations were even lower, for the most part hover- 
ing right around zero. The single exception was the 
GRE analytical thinking score, which correlated 
modestly with almost all of the faculty ratings. 
However, this correlation was observed only for 
men (on the order of r = .3), whereas for women it 
was almost exactly zero in every case! Based upon 
these and similar studies, the consensus would ap- 
pear to be that excessive reliance on the GRE for 
graduate school selection may overlook a talented 
pool of promising graduate students. 

However, other researchers are more support- 
ive in their evaluation of the GRE, noting that the 
correlation of GRE scores and graduate grades is 
not a good index of validity because of the restric- 
tion of range problem (Kuncel, Campbell, & Ones, 
1998). Specifically, applicants with low GRE 
scores are unlikely to be accepted for graduate 
training in the first place, and thus relatively little 
information is available with respect to whether 
low scores predict poor academic performance. Put 
simply, the correlation of GRE scores with gradu- 
ate academic performance is based mainly upon 
persons with middle to high levels of GRE scores, 
that is, GRE-V + GRE-Q totals of 1,000 and up. As 
such, the correlation will be attenuated precisely 
because those with low GREs are not included in 
the sample. Another problem with validating the 
GRE against grades in graduate school is the unre- 
liability of the criterion (grades). Based upon the 
expectation that graduate students will perform at 
high levels, some professors may give blanket A’s 
such that grades do not reflect real differences in 
student aptitudes. This would lower the correlation 
between the predictor (GRE scores) and the cri- 
terion (graduate grades). When these factors are 
accounted for, many researchers find reason to be- 
lieve the GRE is still a valid tool for graduate 
school selection (Melchert, 1998; Ruscio, 1998). 


Medical College Admission Test (MCAT) 


The MCAT is required of applicants to almost all 
medical schools in the United States. The test is 


designed to assess achievement of the basic skills 
and concepts that are prerequisites for successful 
completion of medical school. There are three 
multiple-choice sections (Verbal Reasoning, Physi- 
cal Sciences, Biological Sciences) and one essay 
section (Writing Sample). The Verbal Reasoning 
section is designed to evaluate the ability to under- 
stand and apply information and arguments pre- 
sented in written form. Specifically, the test consists 
of several passages of about 500 to 600 words each, 
taken from humanities, social sciences, and natural 
sciences. Each passage is followed by several ques- 
tions based on information included in the passage. 
The Physical Sciences section is designed to evalu- 
ate reasoning in general chemistry and physics. The 
Biological Sciences is designed to evaluate reason- 
ing in biology and organic chemistry. These physi- 
cal and biological science sections contain 10 to 11 
problem sets described in about 250 words each, 
with several questions following. 

The Writing Sample Test consists of two 30- 
minute essays. This test is designed to assess basic 
writing skills such as developing a central idea, 
synthesizing concepts and ideas, writing logically, 
and following accepted practices of grammar, syn- 
tax, and punctuation. The writing sample essays 
begin with a prompt, which consists of a topic state- 
ment (printed in boldface) followed by instructions 
for interpretation and response. The writing sample 
prompts resemble the following (www.aamc.org): 


Scientists should seek to confirm theories or 
hypotheses rather than to refute them. 


Describe a specific situation in which a scien- 
tist might seek to refute a theory or hypothesis 
rather than to confirm it. Discuss what you 
think determines when scientists should seek to 
confirm theories or hypotheses and when they 
should seek to refute them. 


The writing samples are scored on a 6-point scale 
by independent raters. The basis for the inclusion 
of writing samples in the MCAT is that physicians 
are expected to communicate clearly with patients, 
write lucid and effective medical notes, and con- 
tribute persuasively to local and national debates 
about health care policy. 
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Each of the MCAT scores (except Writing Sam- 
ples) is reported on a scale from 1 to 15 (means of 
about 8.0 and standard deviations of about 2.5). The 
reliability of the test is lower than that of other ap- 
titude tests used for selection, with internal consis- 
tency and split-half coefficients mainly in the low 
.80s (Gregory, 1994a). MCAT scores are mildly pre- 
dictive of success in medical school, but once again 
the restriction of range conundrum (previously dis- 
cussed in relation to the GRE) is at play. In particu- 
lar, examinees with low MCAT scores who would 
presumably confirm the validity of the test by per- 
forming poorly in medical school are rarely admit- 
ted, which reduces the apparent validity of the test. 


Law School Admission Test (LSAT) 


The LSAT is a half-day standardized test required 
of applicants to virtually every law school in the 
United States. The test is designed to measure skills 
considered essential for success in law school, in- 
cluding the reading and understanding of complex 


material, the organization and management of in- 
formation, and the ability to reason critically and 
draw correct inferences. The LSAT consists of mul- 
tiple-choice questions in four areas: reading com- 
prehension, analytical reasoning, and two logical 
reasoning sections. An additional section is used to 
pretest new test items and to preequate new test 
forms, but this section does not contribute to the 
LSAT score. The score scale for the LSAT extends 
from a low of 120 to a high of 180. In addition to 
the objective portions, a 30-minute writing sample 
is administered at the end of the test. The section is 
not scored, but copies of the writing sample are sent 
to all law schools to which the examinee applies. 

The LSAT has acceptable reliability (internal 
consistency coefficients in the .90s) and is regarded 
as a moderately valid predictor of law school grades. 
Yet, in one fascinating study, LSAT scores correlated 
more strongly with state bar test results than with law 
school grades (Melton, 1985). This speaks well for 
the validity of the test, insofar as it links LSAT scores 
with an important, real-world criterion. 


SUMMARY 


1. Aptitude and ability tests differ in focus and 
use. In general, ability is a broad concept whereas 
aptitude typically refers to homogeneous segments 
of ability. Also, ability tests attempt to gauge cur- 
rent functioning, whereas aptitude tests are typically 
used to help predict future performance. 


2. Aptitude tests owe their origin to factor 
analysis, a family of procedures employed to sum- 
marize relationships among variables that are cor- 
related in highly complex ways. For example, 
factor analysis might help a researcher discover that 
a battery of 24 ability tests represents only four un- 
derlying variables, called factors. 


3. The beginning point for every factor an- 
alysis is the correlation matrix, a complete table of 
intercorrelations among all the variables. The vari- 
ables in a factor analysis can include results for any 
more-or-less continuous dimension, such as test 
scores, social class, and behavior ratings. 


4. The factor matrix consists of a table of fac- 
tor loadings showing the weighting of each variable 
on each factor. A factor is a weighted linear sum of 
the variables. The factor loading for each variable 
is a correlation coefficient between the factor an 
that variable. 


5. Factors can be represented as geometric ref- 
erence axes, and the loadings for each variable on 
each factor can be plotted within this space. This al- 
lows the researcher to visualize the location of each 
variable on the two or three most important factors. 


6. Because the position of the reference axes 
is arbitrary, the researcher is free to rotate the axes 
so that they produce a more sensible fit with the 
factor loadings for the variables. A number of dif- 
ferent methods of rotation exist (e.g., rotation to 
positive manifold, rotation to simple structure). 


7. The naming of factors requires judgment 
and inference. In particular, the researcher must 
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attempt to determine the common processes and 
abilities shared by the tests or variables with strong 
loadings on a factor. Also, tests or variables with 
low loadings may help sharpen the definition and 
naming of a factor. 


8. In order for a particular kind of factor to 
emerge from an analysis, some of the tests and mea- 
sures must contain that factor in the first place. Large 
samples, in excess of 200 persons, are preferred. The 
choice of strategies for rotation is important: Or- 
thogonal axes assume that factors are uncorrelated; 
oblique axes accept that factors are correlated. 


9. Used for educational and vocational guid- 
ance in the high school years, the Differential Apti- 
tude Test (DAT) consists of eight independent tests. 
A lack of discriminant validity between the eight 
tests is one concern with this respected battery. 


10. The General Aptitude Test Battery (GATB) 
is typical of the multiple aptitude test batteries 
based upon factor analysis. The GATB consists of 
eight paper-and-pencil tests and four apparatus 
measures derived from a factor analysis of 59 tests. 
The test yields nine factor scores helpful in the pre- 
diction of job performance. 


11. The Armed Services Vocational Aptitude 
Battery (ASVAB), used by the Armed Services to 
screen and assign recruits, is probably the most 
widely used aptitude test in existence. The test 
yields composite scores, each based upon two to 
four subtest scores. A limitation of the ASVAB is 
that the composites are highly correlated with one 
another. 


12. The prediction of academic performance is 
facilitated by such tests as the Scholastic Assess- 
ment Tests (SAT), formerly the Scholastic Aptitude 
Tests, and the American College Test (ACT) as- 
sessment program. 


13. The SAT yields separate Verbal and Math- 
ematics scores anchored to a 1941 mean of 500 (SD 
of 100). The ACT yields four scores (in English, 
mathematics, reading, and science reasoning) re- 
ported on a 36-point standard score scale (mean of 
about 20.7, SD of 5). 


14. Used for admission to many graduate pro- 
grams, the GRE consists of three general tests (Ver- 
bal, Quantitative, and Analytical Writing) as well 
as subject tests in specific areas (e.g., Biology, 
Computer Science, Psychology). The Verbal and 
Quantitative exams are normed to a mean of about 
500 and SD of 100. The Analytical Writing exam 
is scored holistically by trained raters on a 6-point 
scale. 


15. The MCAT is required of applicants to al- 
most all medical schools in the United States. There 
are three multiple-choice sections (Verbal Reason- 
ing, Physical Sciences, Biological Sciences) and 
one essay section (Writing Sample), all designed to 
assess basic skills needed for medical school. 


16. The LSAT is a half-day standardized test 
required of applicants to virtually every law school 
in the United States. The LSAT consists of multi- 
ple-choice questions in four areas: reading com- 
prehension, analytical reasoning, and two logical 
reasoning sections. 
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Essential Concepts in Achievement Tests 


Educational Achievement Tests 
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Cheating: The Dark Side of Achievement Testing 


Summary 


Key Terms and Concepts 


I: this topic, we continue the discussion of group 
tests by surveying their use within educational 
systems. Beginning in the elementary grades, 
school districts use group achievement tests to track 
the progress of individual students and to gauge the 
success of educational programs. The ubiquitous 
practice of group achievement testing within U.S. 
schools is largely a positive affair because it pro- 
vides an objective basis for evaluation. However, 
there is on occasion a-dark side as well, insofar as 
testing can become the tail that wags the dog. The 
negative impact falls into three general categories. 
First, teachers may teach to the tests rather than try- 
ing to impart genuine knowledge. Second, in a quest 
to obtain high scores for their school systems, ad- 
ministrators may foster an environment that en- 
courages liberal, nonstandard testing. Worse yet, 
school personnel may engage in outright fraud such 
as “correcting” answer sheets. The third conse- 
quence is that individual examinees will find inge- 
nious ways of cheating on nationally normed tests. 
We review a few of these disquieting trends at the 
end of this topic. 


ESSENTIAL CONCEPTS 
IN ACHIEVEMENT TESTS 


Achievement tests, known as attainment tests in the 
United Kingdom, are the most widely used of all 
types of tests. Although precise figures on usage 
do not exist, virtually every school-aged child in 
the United States encounters group standardized 


290 


achievement testing on a yearly or biyearly basis. 
One estimate is that public schools administer an 
average of two and one-half tests per student per 
year (Medina & Neill, 1990). Beyond a doubt, the 
number of achievement tests administered sur- 
passes all other forms of psychological and educa- 
tional testing. 

Achievement tests are designed to measure the 
attainment of skills taught within schools or training 
programs. These tests can be quite narrowly defined 
such as a test of punctuation skills, or more broadly 
conceived such as a test of reading comprehension. 
Even though achievement tests differ in their speci- 
ficity, they all serve a related function: to measure 
current skill level in a well-defined domain. 

As catalogued in the Mental Measurements 
Yearbook series (e.g., Plake & Impara, 2001)), lit- 
erally hundreds of achievement tests have been pub- 
lished. It is not feasible to survey the vast panorama 
of these instruments. Instead, we review represen- 
tative achievement tests and focus upon the issues 
raised by their use. We begin with a primer on es- 
sential concepts in achievement testing. 


Group and Individual Achievement Tests 


A fundamental distinction is drawn between group 
achievement tests and individual achievement tests. 
Group achievement tests are used mainly in the 
classroom, whereas individual achievement tests are 
employed one on one in clinical or educational set- 
tings. Group achievement tests might also be 


called educational achievement tests, since these in- 
struments are commonly administered to entire 
school systems at the behest of state school super- 
intendents or other administrators. Of course, group 
tests are given simultaneously to dozens or hun- 
dreds of students at the same time, with all the ad- 
vantages and pitfalls attendant to this approach (see 
Topic 2B, The Testing Process). 

Individual achievement tests play an essential 
role in the diagnosis of a learning disability (LD). 
Not only do these tests provide documentation of im- 
paired performance in such crucial academic areas 
as reading, writing, and calculation, some achieve- 
ment tests can help identify the particular skill 
deficits that underlie learning disabilities. Individual 
achievement tests are used in conjunction with other 
instruments, especially intelligence tests, as dis- 
cussed in Topic 10A, School-Based Assessment. 


Norm-Referenced and 
Criterion-Referenced Tests 


In addition to the fundamental dichotomy that sep- 
arates group from individual achievement tests, 
another important distinction is between norm- 
referenced and criterion-referenced achievement 
tests. The reader will recall from Topic 2A (The 
Nature and Uses of Psychological Tests) that 
norm-referenced tests allow for interpretation in 
reference to a large standardization sample. Norm- 
referenced tests facilitate the reporting of scores as 
percentile ranks, standard scores, and the like. In 
contrast, criterion-referenced tests allow for inter- 
pretation in reference to the specific content mas- 
tered by the individual examinee. For example, a 
criterion-referenced test might determine that an 
examinee knows how to spell correctly 94 out of 
100 items from a designated list of essential words. 
Of course, these two approaches are not necessar- 
ily incompatible. In fact, most major achievement 
test batteries provide both norm-referenced and cri- 
terion-referenced interpretations. 


Ability, Aptitude, and Achievement Tests 


The distinction between ability, aptitude, and 
achievement tests merits brief review in this context. 
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Ability tests sample a broad assortment of skills in 
order to estimate general intellectual level. In con- 
trast, aptitude tests usually measure homogeneous 
segments of ability and are often used to predict fu- 
ture performance. The exceptions here include mul- 
tiple aptitude test batteries that sample abilities 
broadly; these instruments are very similar to abil- 
ity tests. Finally, as noted, achievement tests mea- 
sure current skill attainment, particularly in relation 
to school and training programs. 

In the real world, the distinction between these 
three types of tests is often quite fuzzy (Gregory, 
1994a). It has been known for some time that the 
correlation between an achievement test and an abil- 
ity test may be nearly as high as that between any 
two ability tests. In many cases, achievement and 
ability tests tap similar underlying cognitive factors. 
However, the assumptions that underly these two 
forms of testing differ widely. Achievement tests 
are generally designed to measure the effects of 
relatively standardized educational experiences, 
whereas aptitude tests typically make fewer as- 
sumptions about specific prior learning experiences. 

The applications of aptitude and achievement 
tests also differ widely. Aptitude tests are designed 
primarily to predict future performance in schools or 
training programs. For example, a scholastic apti- 
tude test might be used to predict future academic 
performance in college; a clerical aptitude test might 
be used to predict future performance in the role of 
secretary. In contrast, achievement tests are used to 
gauge a student’s current level of attainment in a 
given subject matter. In other words, aptitude tests 
are oriented to the future, whereas achievement tests 
are oriented to the present. The assessment of cur- 
rent skill level with achievement tests can serve sev- 
eral purposes, as outlined in the following section. 


The Functions of Achievement Testing 


Achievement tests permit a wide range of potential 
uses. Practical applications of individual and group 
achievement tests include the following: 


e To identify children and adults with specific 
achievement deficits who might need more de- 
tailed assessment for learning disabilities 
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To help parents recognize the academic strengths 
and weaknesses of their children and thereby fos- 
ter individual remedial efforts at home 

To identify classwide or schoolwide achievement 
deficiencies as a basis for redirection of instruc- 
tional efforts 

To appraise the success of educational programs 
by measuring the subsequent skill attainment of 
students 

To group students according to similar skill level 
in specific academic domains 

To identify the level of instruction that is appro- 
priate for individual students 


Thus, achievement tests serve institutional goals 
such as monitoring schoolwide achievement levels, 
but also play an important role in the assessment of 
individual learning difficulties. As previously noted, 
different kinds of achievement tests are used to pur- 
sue these two fundamental applications (institu- 
tional and individual). Institutional goals are best 
served by group achievement test batteries, whereas 
individual assessment is commonly pursued with 
individual achievement tests (even though group 
tests may play a role here, too). In this topic we 
focus on group educational achievement tests. 


EDUCATIONAL ACHIEVEMENT 
TESTS 


Virtually every school system in the nation uses at 
least one educational achievement test, so it is not 
surprising that test publishers have responded to the 
widespread need by developing a panoply of ex- 
cellent instruments. In the following section, we 
describe several of the most widely used group 
standardized achievement tests. The tests to be de- 
scribed share several characteristics in common. 
First, these instruments are multilevel batteries 
that contain comparable subtests for students in 
the different grades of primary and/or secondary 
school. Some of the batteries span kindergarten 
(K) through grade 12, whereas others are designed 
for elementary grades (K through 8) or secondary 
grades (9 through 12) only. In a multilevel battery, 
test booklets contain overlapping sections, and stu- 


dents at different grade levels enter and exit the test 
materials at grade-appropriate positions. 

A second feature common to many educational 
test batteries is concurrent norming with an ability 
test. For example, the achievement battery known 
as the Sequential Tests of Educational Progress 
(STEP-IH) is concurrently normed with the ability 
battery known as the School and College Ability 
Test (SCAT-IID. Tests that are concurrently normed 
share the same standardization sample. As a result, 
average performance on one test can be directly 
equated with average performance on the other. 
Concurrent norming is helpful because it allows 
parents, teachers, and counselors to make precise, 
direct, and meaningful comparisons between 
achievement and ability. After all, the implications 
of an achievement score are moderated by knowl- 
edge of the student’s ability. A student with high 
ability scores but low achievement scores might be 
a good candidate for educational intervention, in- 
cluding a more detailed assessment for learning 
disability (as discussed in Topic 10A, School- 
Based Assessment). In contrast, a student with low 
ability scores and low achievement scores might be 
working at full potential; specialized interventions 
may not be warranted. 

The third commonality in group achievement 
tests is that they measure similar educational skills. 
Educational achievement tests tend to emphasize 
these skill areas: 


e Reading, including comprehension and vocabulary 

e Written language, including spelling, punctua- 
tion, and capitalization 

e Mathematics, including computation and appli- 
cation 


In addition, tests at the elementary grade levels 
often assess listening skills, including oral com- 
prehension. Some test batteries also assess knowl- 
edge of basic concepts in science, social studies, 
and humanities. 

Finally, the educational achievement tests dis- 
cussed here possess generally excellent psychometric 
characteristics. Test contents are relevant and appro- 
priate, that is, the instruments show good content 
validity; subscales possess excellent internal and 


alternate-forms reliability; standardization samples 
are invariably large and representative; and overt gen- 
der and race bias are nonexistent. The psychometric 
quality of the widely used educational achievement 
tests is typically respectable, if not extraordinary. 

We survey several widely used tests of educa- 
tional achievement subsequently. The reader will 
discover that a detailed analysis of psychometric 
properties—reliability, validity, norming, and the 
like—is encountered only for the first instrument 
reviewed, the Iowa Tests of Basic Skills. In gen- 
eral, the psychometric quality of the other tests is 
equally laudable, so for these test batteries we focus 
upon functions, applications, special features, and 
an occasional shortcoming or two. Readers who 
desire more information on these instruments 
should consult reviews in the Mental Measure- 
ments Yearbook (Conoley & Impara, 1995; Cono- 
ley & Kramer, 1989, 1992; Impara & Plake, 1998; 
Plake & Impara, 2001; Mitchell, 1985). 


lowa Tests of Basic Skills (ITBS) 


First published in 1935, the Iowa Tests of Basic 
Skills (ITBS) were most recently revised and re- 
standardized in 1992. The ITBS is a multilevel bat- 
tery of achievement tests that covers grades K 
through 8. A companion test, the Tests of Achieve- 
ment and Proficiency (TAP), covers grades 9 
through 12. In order to expedite direct and accurate 
comparisons of achievement and ability, the ITBS 
and the TAP were both concurrently normed with 
the Cognitive Abilities Test (CogAT), a respected 
group test of general intellectual ability. 

The ITBS is available in several levels that cor- 
respond roughly with the ages of the potential ex- 
aminees: levels 5 and 6 (grades K-1), levels 7-8 
(grades 2-3), and levels 9-14 (grades 3-8). The 
basic subtests for the older levels measure vocabu- 
lary, reading, language, mathematics, social stud- 
ies, science, and sources of information (e.g., uses 
of maps and diagrams). 

From the first edition onward, the ITBS has 
been guided by a pragmatic philosophy of educa- 
tional measurement. The manual states the purpose 
of testing as follows: 
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The purpose of measurement is to provide informa- 
tion which can be used in improving instruction. 
Measurement has value to the extent that it results 
in better decisions which directly affect pupils. 


To this end, the ITBS incorporates a criterion-ref- 
erenced skills analysis to supplement the usual 
array of norm-referenced scores. For example, one 
feature available from the publisher’s scoring ser- 
vice is item-level information. This information in- 
dicates topic areas, items sampling the topic, and 
correct or wrong response for each item. Teachers 
therefore have access to a wealth of diagnostic-in- 
structional information for each student. Whether 
this information translates to better instruction—as 
the test authors desire—is very difficult to quantify. 
As Linn (1989) notes, “We must rely mostly on 
logic, anecdotes, and opinions when it comes to an- 
swering such questions.” 

The technical properties of the ITBS are be- 
yond reproach. Internal consistency and equiva- 
lent-form reliability coefficients are mostly in the 
mid-.80s to low .90s. Stability coefficients for a 
one-year interval are almost all in the .70 to .90 
range. The test is free from overt racial and gender 
bias, as determined by content evaluation and item 
bias studies. The year 2000 norms for the test were 
empirically developed from large, representative 
national probability samples. 

Standardization of a previous form in 1988 re- 
vealed an intriguing trend in comparison to results 
for versions that were standardized several years ear- 
lier. The 1988 sample demonstrated higher achieve- 
ment levels, on the order of 1 to 3 months of grade 
equivalent. This pattern of slowly rising test perfor- 
mance emphasizes the need for annual or biannual 
restandardization of achievement test batteries. What 
has happened in the absence of timely restandard- 
ization of major achievement tests is that all 50 states 
can report honestly that they exceed the national av- 
erage on group standardized tests (Cannell, 1988). 

Item content of the ITBS is judged relevant by 
curriculum experts and reviewers, which speaks to 
the content validity of the test (Lane, 1992; Linn, 
1989; Raju, 1992; Willson, 1989). Although the 
predictive validity of the latest ITBS has not been 
studied extensively, evidence from prior editions is 
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very encouraging. For example, ITBS scores cor- 
relate moderately with high school grades (rs 
around .60). The ITBS is not a perfect instrument, 
but it represents the best that modern test develop- 
ment methods can produce. 


Metropolitan Achievement Test (MAT) 


The Metropolitan Achievement Test dates back to 
1930 when the test was designed to meet the cur- 
riculum assessment needs of New York City. The 
stated purpose of the MAT is “to measure the 
achievement of students in the major skill and con- 
tent areas of the school curriculum.” The MAT is 
concurrently normed with the Otis-Lennon School 
Ability Test (OLSAT). 

Now in its eighth edition, the MAT is a multi- 
level battery designed for grades K through 12 and 
was most recently normed in 2000. The areas tested 
by the MAT include the traditional school-related 
skills: 


Reading 
Mathematics 
Language 
Writing 
Science 
Social Studies 


An attractive feature of the MAT is that student 
reading scores are reported as Lexile measures, a 
new and practical indicator of reading level. Lexile 
measures are likely to become a standard feature in 
most group achievement tests in the years ahead, so 
it is worth a brief detour to explain their nature and 
significance. 


Lexile Measures 


The Lexile approach is a major new improvement 
in the assessment of reading skill. It was developed 
over a span of more than twelve years using millions 
of dollars in grant funds from the National Institute 
of Child Health and Human Development (NICHD) 
(www.lexile.com). The Lexile approach is based 
upon two simple, commonsense assumptions, 
namely (1) reading materials can be placed on a 


continuum as to difficulty level (comprehensibility), 
and (2) readers can be ordered on a continuum as to 
reading ability. The Lexile framework provides a 
common metric for matching readers and text, 
which, in turn, permits parents and educators to 
choose appropriate reading materials for children. 

The Lexile scale is a true interval scale. The 
Lexile measure for a reading selection is a specific 
number indicating the reading demand of the text 
based on the semantic difficulty (vocabulary) and 
syntactic complexity (sentence length). Lexile 
measures for reading selections typically range 
from 200L to 1700L (Lexiles). The Lexile score for 
a student, obtained from the Reading Comprehen- 
sion test of the MAT or other achievement tests, is 
a precise index of the student’s reading ability, cal- 
ibrated on the same scale as the Lexile measure for 
text. The value of the Lexile approach is that stu- 
dent comprehension can be predicted as a function 
of the disparity between the demands of the text 
and the student’s ability. For example, when read- 
ers are well targeted (the difference between text 
and reader is close to 0 Lexiles), research indicates 
that reader comprehension will be about 75 per- 
cent. When the text difficulty exceeds the reader’s 
ability by 250L, comprehension drops to about 50 
percent. When the skill of the reader exceeds the 
demands of the text by 250L, comprehension is 
about 90 percent (www.lexile.com). 

The Lexile approach has a number of potential 
benefits and applications for teachers and parents. 
Teachers can look up Lexile measures for specific 
books (the Lexile corporation has evaluated over 
30,000 titles to date) as a way of building a library 
of titles at varying levels. Also, they can produce 
individualized reading lists suitable for each stu- 
dent. Likewise, parents can select well-matched 
books to read to their children. Stenner (2001) cap- 
tures the allure of the Lexile approach as follows: 


One of the great strengths of the Lexile Framework 
is the way it encourages thought about what fore- 
casted comprehension rate would be optimal for 
different instructional contexts. Harry Potter and 
the Goblet of Fire is a910L text. Readers at 400L 
to 500L can nonetheless enjoy listening to this 
story read aloud. A 700L reader could read the text 


in a one-on-one tutoring context. A 900L reader 
will disappear for an hour or two, fully capable of 
self-engaging with the text, and a 1600L adult 
reader can become so engrossed that a two-hour 
plane ride flies by. 


The Lexile approach is not a panacea, but it is a major 
improvement in the assessment of reading skill. 


The lowa Tests of Educational 
Development (ITED) 


The widely used Iowa Tests of Educational Devel- 
opment were first released in 1942, then revised and 
restandardized every few years. The purpose of the 
ITED is: “To assess intellectual skills that are im- 
portant in adult life and provide the basis for con- 
tinued learning.” Unlike many other achievement 
tests which emphasize skills linked to specific cur- 
ricular goals, the intention of the ITED is to measure 
the fundamental goals or generalized skills of edu- 
cation that are independent of the curriculum. For 
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this reason, the ITED items emphasize higher-order 
thinking skills. Rather than testing isolated bits of 
knowledge, questions on the ITED feature problems 
which require the synthesis of knowledge or a 
multiple-step solution (Figure 8.6). 

The ITED is designed for high school students 
in grades 9 through 12. The test yields nine basic 
scores plus a composite: 


Vocabulary 

Reading Comprehension 

Language: Revising Written Measures 
Spelling 

Mathematics: Concepts and Problem Solving 
Computation 

Analysis of Social Studies Materials 
Analysis of Science Materials 

Sources of Information 

Total Battery Score 


The core battery consists of the first six tests listed. 
The ITED is anchored to previous editions so that 





Social Studies 
Advertisement 


Four out of five doctors surveyed favored 


BALM SOAP 


Tests show that Balm Soap clears up complexion problems 
faster than any other product! 





On the basis of this advertisement, which of the following conclusions, if any, is valid? 
A. It has been scientifically demonstrated that the quickest way to get rid of any com- 


plexion problem is to use Balm Soap. 


B. Of the five leading brands of complexion soaps, only one is better than Balm Soap 


from a medical point of view. 


C. Of all the doctors who recommended skin care products, four out of five recom- 


mended Balm Soap. 
D. None of these conclusions is valid. 


Natural Sciences 


Soon after being bitten by a mosquito, a person became ill with yellow fever. Which con- 
clusion, if any, is justified solely from these observations? 


A. There is insufficient evidence to draw any of the conclusions that follow. 


B. Mosquitoes are the direct cause of yellow fever. 


C. The mosquito introduced a microorganism into the person’s bloodstream. - 
D. The mosquito carried an organism that caused yellow fever. 





FIGURE 8.6 
Representative Items 
from the Iowa Tests 

of Educational 
Development 

Source: Reprinted from 
Teacher, Administrator, and 
Counselor Manual: Iowa 
Tests of Educational Devel- 
opment, Forms X-8 and Y-8 
(1988). Chicago: Riverside 
Publishing Co. Copyright © 
1988. Reproduced with per- 
mission of The Riverside 
Publishing Company. 
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regardless of form, level, or edition, a given score 
represents the same level of accomplishment. The 
test was renormed in 2000 with a large national 
sample of high school students. 


The Tests of Achievement 
and Proficiency (TAP) 


The Tests of Achievement and Proficiency (TAP) 
are designed to provide a comprehensive appraisal 
of student progress toward traditional academic 
goals in grades 9 through 12. The TAP is the sec- 
ond component in the Riverside Basic Skills As- 
sessment Program; the first component is the Iowa 
Tests of Basic Skills (ITBS), used in grades K 
through 8 (previously discussed). Like the ITBS, 
the TAP is concurrently normed with the CogAT, an 
ability test that measures verbal, quantitative, and 
nonverbal reasoning abilities. The subtests from the 
TAP measure achievement in reading comprehen- 
sion, mathematics, written expression, using 
sources of information, social studies, and science. 
A total or composite score is also provided. 

The TAP also yields an applied proficiency score 
that assesses the examinee’s capacity to handle real- 
life situations. This score reflects student compe- 
tence in applying mathematics and communication 
skills to solving problems of daily living. The items 
emphasize communication of ideas in writing, math- 
ematical solution of problems, use of reference ma- 
terials, and the interpretation of tabular and graphic 
material. The TAP was conormed with the ITED and 
the CogAT and restandardized in 1996. 


Tests of General Educational 
Development (GED) 


Another widely used achievement test battery is the 
Tests of General Educational Development (GED), 
developed by the American Council on Education 
and administered nationwide for high school equiv- 
alency certification (www.acenet.edu). The GED 
consists of multiple-choice examinations in five ed- 
ucational areas: 


Language Arts—Writing 
Language Arts—Reading 


Mathematics 
Science 
Social Studies 


The Language Arts—Writing section also contains 
an essay question that examinees must answer in 
writing. The essay question is scored independently 
by two trained readers according to a 6-point holis- 
tic scoring method. The readers make a judgment 
about the essay based upon its overall effectiveness 
in comparison to the effectiveness of other essays. 

The GED comes in numerous alternate forms. 
Typically, internal consistency reliabilities for the 
subscales are above .90. However, the interrater re- 
liability of scoring on the writing samples is more 
modest, typically between .6 and .7. These findings 
indicate that a liberal criterion for passing this sub- 
test is appropriate so as to reduce decision errors. 
Regarding validity, the GED correlates very 
strongly (r = .77) with the graduation reading test 
used in New York (Whitney, Malizio, & Patience, 
1985). Furthermore, the standards for passing the 
GED are more stringent than those employed by 
most high schools: Currently, individuals who re- 
ceive a passing score for a GED credential outper- 
form at least 40 percent of graduating high school 
seniors (www.acenet.edu). 

The GED emphasizes broad concepts rather 
than specific facts and details. In general, the pur- 
pose of the GED is to allow adults who did not 
graduate from high school to prove that they have 
obtained an equivalent level of knowledge from life 
experiences or independent study. Employers re- 
gard the GED as equivalent (if not superior) to 
earning a high school diploma. Successful perfor- 
mance on the GED enables individuals to apply to 
colleges, seek jobs, and request promotions that re- 
quire a high school diploma as a prerequisite. 
Rogers (1992) and Trevisan (1992) provide unusu- 
ally thorough reviews of the GED. 


Additional Group Standardized 
Achievement Tests 


In addition to the previously listed batteries, a few 
other widely used group standardized achievement 
tests deserve brief mention. Because these tests 


strongly resemble the instruments discussed previ- 
ously, we provide only the barest listing here. The 
Sequential Tests of Educational Progress (STEP-II) 
are organized into two batteries, one used for grades 
K through 3, the other used for grades 3 through 12. 
The basic STEP-III battery assesses the following 
educational skills: reading, writing skills, vocabulary, 
mathematics computation, and mathematics con- 
cepts. Additional tests measure attainment in social 
studies, science, study skills, and oral comprehension 
(listening). The STEP-III is a companion test to the 
School and College Ability Tests (SCAT-II]). 

The widely used Stanford Achievement Series 
is one of the oldest and most prestigious testing 
programs in the United States. The series consists 
of three related test batteries covering grades K 
through 13: the Stanford Early School Achieve- 
ment Test (SESAT) for kindergarteners and first 
graders; the Stanford Achievement Test (SAchT) 
for grades 1 through 9; and the Stanford Test of 
Academic Skills (TASK) for grades 8 through 13 
(grade 13 refers to the first year of college). Re- 
viewers are cautious about the SESAT because the 
value of the test is predicated solely upon content 
validity. Little is known about test-retest reliability, 
criterion-related validity, and construct validity 
(Ackerman, 1992; Carpenter, 1992). The SAchT is 
lauded because of its excellent norm-referenced 
coverage of a representative and balanced national 
consensus curriculum (Brown, 1992; Stoker, 
1992). The TASK has excellent psychometric char- 
acteristics, but in attempting to span high school 
and college achievement, this test undertakes a dif- 
ficult assignment. After all, there is modest agree- 
ment about the curricular intentions of grade school 
and high school, but what are the educational goals 
of the first year in college? 


SPECIAL-PURPOSE 
ACHIEVEMENT TESTS 


Achievement tests can be used for many important 
applied purposes, including the appraisal of knowl- 
edge in advanced fields and the evaluation of pro- 
fessional competency. In this final section we will 
examine two special-purpose achievement tests. 
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The College-Level Examination Program (CLEP) 
is a widely used program by which students can 
demonstrate college-level achievement and receive 
advance credit or exemption from certain college 
courses. The National Teacher Examination (NTE) 
is a controversial test required by many states for 
teacher certification. 


College-Level Examination 
Program (CLEP) 


CLEP is one of two national testing programs 
through which students can receive college credit by 
examination without enrolling in the courses. The 
other program is the ACT Proficiency Examination 
Program, which we do not discuss here. CLEP is ad- 
ministered by the College Board with financial sup- 
port from the Carnegie Corporation of New York. 
The original purpose of the program was to support 
nontraditional students such as returning veterans 
and older adults who had obtained valuable learn- 
ing experiences outside of the classroom. However, 
it is mainly ambitious high school students enrolled 
in advanced classes who now register to take the 
CLEP examinations. Some students begin college 
with nearly a full semester of course credits ob- 
tained through CLEP and similar programs. 

CLEP examinations cover material taught in 
basic first- and second-year courses, and colleges 
usually grant the same amount of credit as would be 
earned in the corresponding courses. Except for 
English Composition with Essay, each exam is 90 
minutes long and composed primarily of multiple- 
choice questions; a few exams have a fill-in format. 
For the English Composition with Essay test, stu- 
dents also write an essay responding to a specific 
question. Each essay is read by two or more faculty 
consultants, and this grade is combined with the 
multiple choice score and reported as a scaled score. 
Areas tested include American literature, English 
literature, foreign languages (French, German, 
Spanish), American government, United States his- 
tory, principles of macroeconomics, introductory 
psychology, college algebra, biology, chemistry, 
and principles of accounting. 

Scores on the CLEP tests are reported on a 
scale from 20 to 80, with an average of 50 and a 
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standard deviation of 10. The reference groups for 
these scores consisted of volunteer students com- 
pleting courses in each of the specified areas. These 
students were recruited from a nonrandom but pre- 
sumably representative selection of U.S. colleges 
and universities. The CLEP scores are, in general, 
highly reliable, with split-half coefficients mainly 
in the .90s. The validity of the Subject Examina- 
tions has been evaluated by means of correlating 
CLEP scores with final grades in the relevant 
courses. Most of these correlation coefficients are 
in the .40s and .50s, which supports the concurrent 
validity of these tests. 

The CLEP program has received high marks 
from reviewers, but there is a potential negative side 
as well. In particular, some students might “test out” 
of college courses that would have proved enrich- 
ing, inspiring, even life-changing. For example, it is 
possible to have factual knowledge about art, music, 
or drama and therefore pass a CLEP test in one or 
more of these areas. However, in the ambitious 
quest to finish college quickly, students could over- 
look important experiences for personal growth. 


National Teacher Examination (NTE) 


The National Teacher Examination is actually a se- 
ries of tests published by the Educational Testing 
Service and known more formally as the Praxis Se- 
ries. Of the 43 states that require testing as part of 
the licensure process, 35 use the Praxis Series, 
which explains why the test is known informally as 
the National Teacher Examination, or NTE. The 
Praxis Series is nationally administered and contin- 
ually updated and improved. The three categories 
of assessment correspond to major milestones in 
teacher development: 


e Praxis I (Academic Skills Assessments): Enter- 
ing a teacher training program 

e Praxis II (Subject Assessments): Graduating 
from college and entering the profession 

e Praxis III (Classroom Performance Assess- 
ments): The first year of teaching 


The initial test, Praxis I, is taken early in the stu- 
dent’s college career to evaluate reading, writing, 


and math skills essential for the success of any 
teacher. A passing score on this multiple-choice test 
is required before the student can continue his or 
her major in education. These tests can be taken in 
the traditional paper-and-pencil format or as a 
computer-based test that is tailored to each candi- 
date’s ongoing performance. One advantage to the 
computer-based testing is year-round availability, 
whereas the traditional version is given only six 
times a year. Praxis II assesses knowledge of the 
subjects a candidate will teach, as well as how 
much he or she knows about teaching the subject. 
More than 120 content tests (all multiple choice) 
are available. Praxis II is an in-class evaluation by 
trained local assessors who use structured criteria 
that have been nationally validated. 

The reliability of the Praxis I and Praxis II tests 
is beyond reproach. Similarly, the content validity 
of these tests is outstanding because they were care- 
fully constructed and refined with the help of many 
experts and test consumers. What is less clear is the 
predictive validity of the Praxis Series insofar as lit- 
tle information exists to show that good scores. or 
Praxis evaluations predict good teaching and vice 
versa. Of course, part of the difficulty here is find- 
ing a suitable definition and measure of “good 
teaching.” The National Teaching Examination 
probably serves a useful purpose by requiring that 
prospective teachers possess minimum levels of 
knowledge in their disciplines, but the test also 
raises difficult questions with regard to how our 
society identifies promising teachers. Is factual 
knowledge enough? Should we not also insist that 
our teachers possess enthusiasm for their material 
and the capacity to inspire children? These are fea- 
tures not easily captured by objective tests. 


CHEATING: THE DARK SIDE 
OF ACHIEVEMENT TESTING 


The prevailing view in the general public is that 
cheating rarely or never occurs in nationally ad- 
ministered testing programs. We tend to think that 
the risks are too high and the opportunities too 
limited for cheaters to prevail. Therefore, we rest 
assured that test fraud must be a rare event. Unfor- 


tunately, this view is probably naive. After all, a 
growing number of people must pass a test to gain 
college entry, get a job, or obtain a promotion. Fur- 
thermore, school officials increasingly are evalu- 
ated on the basis of average test scores in their 
district. Precisely because the stakes are so high, un- 
scrupulous individuals will try to beat the system. 
Consider the case of superior test scores at an 
acclaimed elementary school in Connecticut (As- 
sociated Press, May 4, 1996, and March 15, 1997). 
The stellar reputation of the school was based upon 
high exam scores on the Iowa Tests of Basic Skills 
given to first, third, and fifth graders. The school 
had won blue ribbons from the U.S. Education De- 
partment and was featured as one of the nation’s 
best elementary schools in a prominent magazine. 
However, in a fluke discovery, school district per- 
sonnel noted a high number of erasures on the tests 
from this school and notified the test publisher. On 
close inspection, the publishers found an exceed- 
ingly high number of erasures—9 percent—which 
was three to five times higher than two nearby 
schools. Even more suspicious was the fact that 89 
percent of the erasures were changed from the 
wrong answer to the correct answer. Based upon 
retesting under close supervision, the test publisher 
found “clearly and conclusively” that tampering 
occurred. The principal resigned amid allegations 
that he was responsible for the tampering. 
Widespread cheating in public school systems 
is sporadically reported in many large cities across 
the United States. In most cases, the cheating is mo- 
tivated by the desire of teachers and principals to 
further their own careers by creating the illusion of 
educational excellence. For example, in 1999, 
dozens of teachers and two principals in the New 
York City public school system were charged with 
helping students cheat on the standardized reading 
and math tests used to rank schools and determine 
whether students move on to the next grade (New 
York Times, December 12, 1999). The cheating 
scheme was described as “one of the largest in the 
recent history of American public schools.” In 2000, 
an entire eighth-grade class in a Chicago elemen- 
tary school was required to retake the lowa Tests of 
Basic Skills because a school administrator al- 
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legedly filled in incomplete tests and changed in- 
correct answers to correct ones (Chicago Tribune, 
June 2, 2000). Officials were tipped off to the fraud 
because the test scores were simply too good to be 
true—the average score for the class was two years 
above their standing. In 2002, Chicago was back in 
the news again, when sophisticated software de- 
tected skills-test cheating at seven schools (Chicago 
Tribune, October 2, 2002). In this case, the school 
chief sought to fire six teachers and an aide, re- 
marking, “We need to stand for something, to teach 
values to our students.” Of course, we only read 
about the cases of cheating that are detected. The 
number of undetected cases is simply unknown, al- 
though probably larger than the public would like to 
believe. 

An especially flagrant instance of cheating on 
national tests was uncovered in Louisiana in 1997, 
This case involved wholesale circulation of the Ed- 
ucational Testing Service (ETS) exam administered 
to teachers who want to be school principals. As re- 
ported in the New York Times (September 28, 1997), 
copies of the 145-item test, along with correct an- 
swers, had circulated among teachers throughout 
southern Louisiana, most likely for several years. In 
a state ranked at or near the bottom on nearly every 
educational index, it appears that many potentially 
unqualified persons cheated their way into running 
the schools. ETS handled this case quietly by ask- 
ing more than 200 teachers to retake the test so as 
to “confirm” their initial scores. Unfortunately, the 
Louisiana case was not an isolated instance. The 
New York Times article includes this disquieting 
conclusion: 


In numerous instances across the country, E.T.S. 
has confronted case after case of cheating but with- 
held information from the public and failed to take 
aggressive steps in time to insure the integrity of its 
tests, according to internal documents and inter- 
views with current and former officials there. 


Among the examples cited, ETS allegedly failed to 
monitor its handling of the federal government’s 
test for immigrants who want to become citizens, 
with the likely result that test supervisors accepted 
bribes. English-proficiency tests for foreign students 
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also were vulnerable to cheating. In 1994, ETS can- 
celed the scores of 30,000 students from China 
after discovering a ring that was selling the exami- 
nations abroad. In another case, federal prosecutors 
uncovered a nationwide cheating ring involving 
hundreds of students who paid thousands of dollars 
each for answers to the GRE and similar exams. 
This scheme involved a well-known time-zone 
scam in which experienced test takers took the 
exams in New York City and then relayed the an- 
swers to paying customers taking the tests in the 
later time zones. Cizek (1999) catalogues literally 
dozens of ingenious ways that students have devel- 
oped for cheating on tests: writing information on 
the floor, in tissues, on the back of a bottled water 
label; using an ultraviolet pen to write information 
on “blank” paper; and using a video transmitter 
(e.g., hidden in an eyeglass case) to send pictures 
of the test to an outside accomplice who then 
coaches the student by means of an audio receiver 
(e.g., hidden in the ear). 

Dishonest and inappropriate practices by 
school officials are implicated in the recent infla- 
tion of scores on nationally normed group tests of 
achievement. By definition, for a norm-referenced 
test, 50 percent of the examinees should score 
above the 50th percentile, 50 percent below. If the 
same test is used in a large sample of typical and 
representative school systems, average scores for 
the school systems should be split evenly—about 
half above the nationally normed 50th percentile, 
half below. 

According to a recent survey reported in the 
news media (Foster, 1990), virtually all states of the 
union claim that average achievement scores for 
their school systems exceed the 50th percentile. 
The resulting overly optimistic picture of student 
achievement is labeled the Lake Wobegon Effect, 
in reference to humorist Garrison Keillor’s mythi- 
cal Minnesota town where “all the children are 
above average.” 

How does inflation of achievement test scores 
arise? According to Cannell (1988), the major 
cause is educational administrators who are des- 
perate to demonstrate the excellence of their school 
systems. Precisely because our society attaches so 


much importance to achievement test results, some 
educators apparently help students cheat on stan- 
dardized tests. The alleged cheating includes the 
following: 


e Teachers and principals coach students on test 
answers. 

e Examiners give more than the allotted time to 
take tests. 

e Administrators alter answer sheets. 

¢ Teachers teach directly to the specific test items. 

e Teachers make copies of the tests to give to their 
students, 


Cannell notes that over 300 teachers and school 
administrators answered his trade journal adver- 
tisement, admitting that they or colleagues had 
tampered with tests or helped students improperly. 
These improprieties constitute a quiet crisis that 
continues unabated. Another consequence of the 
Lake Wobegon Effect is that test publishers and 
federal reviewers will likely increase their efforts 
to monitor test security. In sum, the importance that 
our society attaches to achievement test scores has 
caused a number of unappealing side effects that 
undermine the very foundations of nationally 
normed group-testing programs. 

Moore (1994) reports on a special case in edu- 
cational testing, namely, the districtwide conse- 
quences of court-ordered achievement testing. He 
surveyed 79 teachers from third- through fifth- 
grade level in a midwestern town in which the court 
required the use of a standardized test to determine 
the effectiveness of a desegregation effort. The test 
in question, the Iowa Tests of Basic Skills (ITBS) 
is a well-respected group achievement test that re- 
quires strict adherence to instructions and time lim- 
its for obtaining valid results (discussed earlier). Yet 
the teachers found little value in the testing pro- 
gram, complaining that its benefits did not offset 
the time and costs involved. As a consequence of 
their devaluing the effort, nonstandard testing was 
practically the rule rather than the exception. The 
teachers engaged in several nonstandard practices, 
most of which tended to inflate the test scores. In- 
appropriate testing practices included praising stu- 
dents who answered a question correctly during the 





test (67 percent), using last year’s test questions for 
practice (44 percent), recoding a student’s answer 
sheet because he or she just “miscoded” the answer 
(26 percent), giving students as much time as they 
needed (24 percent), giving students items that 
were directly off the test (24 percent), and giving 
hints or clues during the test (23 percent). In gen- 
eral, Moore (1994) notes that teachers modified 
their instructional efforts and curriculum in antici- 
pation of having their students take the test. More 
than 90 percent of the teachers added test-related 
lessons to the curriculum, and more than 70 percent 
eliminated topics so that they could spend more 
time on test-related skills. Whether these are desir- 
able changes is surely open to debate. Moore 
(1994) concludes: 


Standardized testing has held a central role in edu- 
cation for many years. What studies of testing pro- 
gram impact have most recently demonstrated is 
the growing reliance on test scores for decision 
making and the increasing potential for misuse of 
test scores. Educational and political policymakers 
need to address the important link between instruc- 
tion and testing and ensure that teachers are inte- 
grated into, not isolated from, the intent of testing. 
(p. 365) 
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In sum, what this study demonstrates is that man- 
dated educational testing can have the unantici- 
pated consequence of polluting the validity of a 
worthy test—especially when crucial stakeholders 
have no voice in the process. 

We cannot survey here all the unintended side 
effects of educational achievement tests, because 
the possibilities are nearly endless. For example, 
what about the warping effects of achievement-test- 
ing programs upon school curricula?.As we have 
seen in Moore’s (1994) study, teachers do modify 
their classroom practices with the intention of help- 
ing students score well on the tests. However, in 
teaching to the tests, educators may emphasize bits 
and pieces of factual knowledge rather than impart- 
ing a general ability to think clearly and solve prob- 
lems. In conclusion, it appears that an excessive 
emphasis upon nationally normed achievement tests 
for selection and evaluation promotes inappropriate 
behavior, including outright fraud and cheating on 
the part of students and school officials. Just how 
widespread is the problem? Although we live with 
the optimistic assumption that fraud in nationally 
normed testing programs is rare, the disturbing truth 
is that we really don’t know how often this occurs. 


SUMMARY 


1. Achievement tests are used to measure the 
attainment of skills taught within schools or train- 
ing programs. Group achievement tests are used 
mainly in the classroom, whereas individual 
achievement tests are employed one-on-one in clin- 
ical or educational settings. 


2. The distinction between ability, aptitude, 
and achievement tests is fuzzy. In fact, the correla- 
tions among these three kinds of measures are often 
very high. However, the typical applications of 
these tests differ: aptitude tests are used to predict 
future performance, whereas achievement tests 
gauge current functioning. 


3. The functions of group achievement test- 
ing include screening for possible learning dis- 
ability, identification of individual strengths and 


weaknesses, grouping of students, and appraising 
the success of educational programs. 


4. Test publishers have developed several ex- 
cellent multilevel test batteries to meet the needs of 
school systems for assessment of educational 
achievement. In a multilevel battery, test booklets 
contain overlapping sections, and students at dif- 
ferent grade levels enter and exit the test materials 
at grade-appropriate places. 


5. The Iowa Tests of Basic Skills is a multi- 
level battery of achievement tests that covers grades 
K through 8. Concurrently normed with the 
Cognitive Abilities Test (CogAT), the ITBS 
measures achievement in such basic areas as 
vocabulary, reading comprehension, spelling, and 
mathematics. 
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6. The Metropolitan Achievement Test 
(MAT) is a multilevel battery for grades K through 
12. The MAT was one of the first achievement tests 
to report reading scores with the Lexile approach 
that allows for matching readers and text. 


7. Unlike many other achievement tests that 
emphasize skills linked to specific curricular goals, 
the intention of the Iowa Tests of Educational De- 
velopment (ITED) is to measure the fundamental 
goals or generalized skills of education that are in- 
dependent of the curriculum. For this reason, the 
items used in the ITED stress higher-order thinking 
skills. 


8. The Tests of Achievement and Proficiency 
(TAP) are designed to provide a comprehensive ap- 
praisal of student progress toward traditional aca- 
demic goals in grades 9 through 12. Companion 
test to the ITBS, the TAP is concurrently normed 
with the CogAT, an ability test that measures ver- 
bal, quantitative, and nonverbal reasoning abilities. 


9. Another widely used achievement test bat- 
tery is the Tests of General Educational Develop- 
ment (GED), developed by the American Council 
on Education and administered nationwide for high 
school equivalency certification. The GED consists 


of multiple-choice examinations in five educational 
areas and includes an essay question to assess writ- 
ing skills. 

10. The College-Level Examination Program 
(CLEP) allows students to receive college credit by 
examination without enrolling in the courses. There 
are five general examinations (e.g., Humanities, 
College Mathematics) and 30 specific examina- 
tions (e.g., American literature, introduction to psy- 
chology, algebra, biology). 


11. Published by the Educational Testing Ser- 
vice, the National Teacher Examination (NTE) is 
known more formally as the Praxis Series. This 
evaluation consists of three categories (academic 
skills, subject expertise, and classroom perfor- 
mance). The subject assessments are required by 
many states as a prerequisite for licensure. 


12. The prevalence of cheating on nationally 
administered achievement tests is unknown. How- 
ever, many reports have surfaced in recent years, in- 
cluding the alteration of answer sheets by school 
officials, wholesale circulation of some licensing 
examinations, and inappropriate testing practices by 
individual teachers (e.g., giving extra time to finish 
tests). 
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Key Terms and Concepts 


N +8 assessment is the appli- 
cation of specialized tests for purposes of di- 
agnosing and treating individuals with known or 
suspected brain dysfunction. Neuropsychological 
tests are distinctive because of their demonstrated 
link to brain functions (Lezak, 1995). Specifically, 
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any test or technique that is highly sensitive to the 
effects of brain impairment—and especially a 
test or technique that permits inferences about the 
site, type, or degree of such impairment— would 
qualify as a neuropsychological procedure. The 
purpose of this chapter is to introduce the reader to 
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neuropsychological tests, concepts, and methods. 
In addition, the chapter includes a secondary em- 
phasis upon assessment of age-related disorders 
such as Alzheimer’s disease, multi-infarct demen- 
tia (stroke), and late-life depression, because these 
problems are best understood within a neuropsy- 
chological context. In Topic 9A, A Primer of Neu- 
ropsychology, we provide a condensed review of 
human brain functions. We also give special con- 
sideration to disorders of aging and review medical 
approaches to brain imaging. The central purpose 
of this topic is to prepare the reader for the ensuing 
discussion of neuropsychological and geriatric as- 
sessment procedures. In Topic 9B, Neuropsycho- 
logical and Geriatric Assessment, we examine the 
nature and use of major neuropsychological tests 
and geriatric assessment procedures. 
Neuropsychology is the study of the relation- 
ship between brain function and behavior (Cy- 
towic, 1996; Kolb & Whishaw, 2002). Human 
neuropsychology did not flourish until World War 
II, when psychologists and neurologists encoun- 
tered thousands of brain-injured soldiers in need of 
assessment and rehabilitation (Aita, Armitage, Re- 
itan, & Rabinowitz, 1947; Goldstein, 1944). It was 
a grim circumstance for innovation, but the carnage 
of war obliged psychologists to develop new tools 
for understanding the effects of brain damage. 
Thus, as humankind ravaged its own during the 
Second World War, several specialists forged a new 
specialty, clinical neuropsychology, to help deal 
with the aftermath. The first and most fundamental 
goal of this fledgling discipline was to determine 
the relationships between brain injury and perfor- 
mance upon psychological tests (Case Exhibit 9.1). 
The central focus of human neuropsychology 
is the advancement of a science of human behav- 
ior founded on human brain function (Kolb & 
Whishaw, 2002). Students of neuropsychological 
assessment must be familiar with the essentials of 
brain function if they are to appreciate the role of 
tests in the diagnosis of brain-impairing conditions. 
In this topic we present a primer of neurological 
terms and concepts, including a review of brain 
anatomy and function, with discussion of medical 
approaches to imaging the brain: We also touch 


briefly upon the major forms of neuropathology 
that might disrupt the normal functioning of the 
brain, with special emphasis upon the elderly. 


| ANATOMY OF THE BRAIN 


By convention the nervous system is divided into 
the central nervous system consisting of the brain 
and spinal cord and the peripheral nervous system, 
which includes the cranial nerves and the network 
of nerves emanating from the spinal cord. Neu- 
ropsychological assessment aims to determine the 
relationship between brain and behavior, so the 
brain will be the major focus of our discussion here. 

The brain weighs roughly three pounds. It is 
composed principally of two components, neurons 
and glial cells. For decades it has been believed that 
neurons do not reproduce in adult mammals. How- 
ever, recent research suggests that adult humans 
may have a limited capacity to reproduce neurons, 
especially in areas of the brain important for learn- 
ing and memory (Rakic, 2002). Capable of pro- 
lific reproduction, the more numerous glial cells 
provide various forms of structural support to the 
neurons. 

The 10!!, or 100 billion, neurons in the brain 
are arranged in complex networks that largely have 
defied understanding. In part, the inscrutability of 
the brain derives from its computational complex- 
ity. Neurons communicate by sending all-or-none 
electrochemical impulses to one another. Each neu- 
ron might send transmissions to thousands, perhaps 
tens of thousands, of other neurons at near and 
distant sites called synapses. Chemical communi- 
cations across the synapses can occur up to a thou- 
sand times a second. Even if we use a conservative 
estimate of 1,000 synapses per neuron, in theory 
the number of neural transmissions that could 
occur in just one second is a staggering 10!7, or 
100,000,000,000,000,000 (one hundred quad- 
rillion). No wonder that staid neuroscientists such 
as Sir John Eccles (who received a Nobel Prize for 
his work in neurophysiology) resort to hyperbole 
and describe the brain as “without qualification the 
most highly organized and most complexly orga- 
nized matter in the universe” (Eccles, 1973). 
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In spite of the difficulties in comprehending 
the brain, neuroscientists have formulated a few 
rudimentary and partial theories from the following 
sources: 


1. Behavioral studies of patients with discrete 
brain lesions 

2. Clinical observations of the effects of electrical 
stimulation of selected brain sites in conscious 
epileptic patients about to undergo therapeutic 
brain excisions 

3. Laboratory studies of epileptic patients in whom 
the cerebral hemispheres have been surgically 
disconnected for purposes of seizure control 

4. Research studies with new brain-imaging tech- 
niques that provide a real-time analysis of on- 
going brain activity. 


In the remainder of this topic, we provide the reader 
with a primer of the brain-behavior relationships 
discerned from these sources. 


Functional Organization of the Brain 


The functional organization of the human brain is 
difficult to comprehend because important structures 
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are interwoven and folded over upon one another. In 
addition, the brain contains a complicated system of 
fluid-filled caverns called ventricles. Although they 
begin in the center of the brain, the ventricles consist 
of a canal system that ultimately extends outward 
and wraps around the entire brain and also down the 
spinal cord to act as a buffer for the extremely deli- 
cate structures of the central nervous system. 

Even though the spatial arrangement of brain 
structures is very complex (Figure 9.1), from the per- 
spective of neural interconnections, the divisions of 
the brain are linear successions of one another: hind- 
brain (myelencephalon and metencephalon), mid- 
brain (mesencephalon), and forebrain (diencephalon 
and telencephalon). Structurally, the lowest brain 
centers (in the hindbrain) are the most simply orga- 
nized, whereas the more forward brain centers (in 
the forebrain) are large, elaborate, and anatomically 
complex. Functional organization follows this pat- 
tern also: The lower brain centers mediate primitive, 
simple life functions such as breathing, whereas the 
forward brain centers govern complex, higher func- 
tions such as thought and perception. 

In the fully developed brain, each division con- 
sists of several important structures (Table 9.1). In 





Forebrain 






Cerebral cortex 


Hypothalamus ) 


FIGURE 9.1 

A Midline View of the Right Cerebral Hemi- 
sphere Showing Important Structures in the 
Hindbrain, Midbrain, and Forebrain 








TABLE 9.1 The Divisions of the Human Brain 
Forebrain 
Telencephalon Cerebral cortex 
(endbrain) Corpus callosum 
Basal ganglia 
Putamen 
Globus pallidus 
Caudate nucleus 
Amygdala 
Limbic lobe 
Hippocampus 
Septum 
Cingulate gyrus 
Olfactory bulbs 
Diencephalon Thalamus 
(between-brain) Hypothalamus 
Pineal body 
Midbrain 
Mesencephalon Tectum 
(midbrain) Superior colliculi 
Inferior colliculi 
Tegmentum 
Cranial nerves 
Hindbrain 
Metencephalon Cerebellum 
(across-brain) Pons 
Reticular formation* 


Myelencephalon Medulla oblongata 


(spinal brain) 





*The reticular formation begins in the hindbrain and extends up- 
ward into the diencephalon. 


the remainder of this unit, we review function and 
dysfunction in these major structural components 
of the human brain. 


BUT 


The lowest part of the brain is the hindbrain, and its 
lowest section is the medulla oblongata. All of the 
nerve fibers from higher parts of the brain pass 
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through the medulla oblongata on their way to the 
spinal cord. The crisscrossing of these fibers in the 
medulla explains why the two halves of the brain 
wind up controlling the opposites side of the body. 
The medulla mediates a number of essential bodily 
functions: breathing, swallowing, vomiting, blood 
pressure, and, partially, heart rate (Kandel, Schwartz, 
& Jessell, 1995). Aspects of talking and singing also 
are governed here, although higher brain sites are in- 
timately involved in these functions as well. 

Significant damage to the medulla usually is 
fatal. In rare cases, a small stroke in the medulla 
causes one or more of the following symptoms: op- 
posite-sided paralysis, partial loss of pain and tem- 
perature sense, clumsiness, dizziness, same-sided 
paralysis and atrophy of the tongue, and partial loss 
of the gag reflex. The polio virus—rampant in the 
1950s but now well controlled—may attack the 
medulla, shutting down the neural control of breath- 
ing and necessitating a mechanical respirator. 

The reticular formation, a network of ascend- 
ing and descending nerve cell bodies and fibers, be- 
gins in the spinal cord and extends through the 
medulla all the way up to the thalamus. Specific nu- 
clei within the reticular formation project to wide 
areas of the brain and thereby help mediate com- 
plex postural reflexes and muscle tone. Based on 
the classic studies of Moruzzi and Magoun (1949) 
demonstrating that ascending nerve tracts within 
the reticular formation govern general arousal or 
consciousness, portions of this structure are also 
known as the reticular activating system. Damage 
to the reticular activating system gives rise to global 
diminution of consciousness ranging from chronic 
drowsiness to stupor or coma (Carpenter, 1991). 

The pons and cerebellum are the highest struc- 
tures in the hindbrain. Together they help coordi- 
nate muscle tone, posture, and hand and eye 
movements. Lesions of the pons may render the in- 
dividual incapable of making coordinated lateral 
eye movements. For this reason, neurologists com- 
monly ask patients to demonstrate left-right and up- 
down eye movements. 

Although many brain sites are involved in 
motor control, the cerebellum plays an important 
role. The cerebellum receives sensory information 
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from every part of the body and coordinates the de- 
tails of automatic, skilled movements. Damage to 
the cerebellum may cause a variety of motor dis- 
turbances, depending upon the specific sites af- 
fected (Bradshaw & Mattingley, 1995). Slurred, 
hesitant speech known as dysarthria may be a 
symptom of cerebellar damage. Muscles may be- 
come flabby and tire easily. Rapid, coordinated tap- 
ping of the index finger may prove difficult. 
Measures of finger-tapping speed (Reitan & Wolf- 
son, 1993) are therefore an important component 
of neuropsychological test batteries. 

Bodily movements may lose their coordination 
in cerebellar disease, becoming spasmodic and 
jerky. Even a simple gesture such as reaching for a 
cup may result in the inadvertent thrusting of cup 
and contents halfway across the room. The char- 
acteristic wide-based gait found in many chronic 
alcoholics—called ataxia—is a consequence of 
cerebellar degeneration (Ghez, 1991). Another 
symptom of cerebellar damage is intention tremor, 
so named because it is not present at rest but arises 
during voluntary, intentional movements of the 
hands. Nystagmus also is common in cerebellar 
disease. In this symptom, the eyes appear to jitter 
back and forth even when the individual attempts 
to hold a steady gaze. 

In conjunction with the vestibular center in the 
inner ear, the cerebellum also helps coordinate the 
vestibuloocular reflex (VOR). The VOR acts to 
maintain the eyes on a fixed target when the head 
is rotated. Without the VOR, vision would be in- 
credibly blurred whenever the head moved even a 
fraction of an inch. Instead, a small area of the 
cerebellum coordinates a rapid refixation of the 
eyes to compensate for head movements. 


| MIDBRAIN 


The midbrain consists of two main divisions that 
wrap around a fluid-filled aqueduct. The tectum or 
roof lies above the aqueduct, and the tegmentum or 
floor lies below the aqueduct. The tectum consists 
mainly of two sets of bilaterally symmetrical nu- 
clei, the superior colliculi and the inferior colliculi. 
The superior colliculi mediate head and eye move- 


ments used to localize and follow visual stimuli. 
The inferior colliculi provide the same function for 
auditory stimuli. 

The tegmentum is a relay station for sensory 
and motor fibers and also contains nuclei for many 
of the cranial nerves (some of which also emanate 
from the hindbrain). The 12 paired cranial nerves 
are major neural tracts whose functions are well un- 
derstood and easily tested. Some are exclusively 
sensory, relaying information from the external 
world to the brain; some are exclusively motor, 
serving to execute commands from the brain; about 
a third of the cranial nerves possess both sensory 
and motor functions. Neurologists refer to the cra- 
nial nerves by number. The numbers correspond 
roughly to the top to bottom sequence of the nerves’ 
emergence from the brain (Table 9.2). The reader 
will notice that many cranial nerves mediate aspects 
of vision and eye movement, basic sensory func- 
tions, and movement of jaw, tongue, face, and head. 
Over the centuries, neurologists have devised a va- 
riety of simple confrontational techniques to assess 
the cranial nerves. As peculiar as it may appear, ask- 
ing the patient to stick out his or her tongue and 
move it left, right, up, or down can provide im- 
portant information about the functioning of the 
hypoglossal (twelfth) cranial nerve. In like manner, 


TABLE 9.2 The Cranial Nerves and Their 
Functions 





1. Olfactory Sense of smell 

2. Optic Vision 

3. Oculomotor Horizontal and vertical eye 
movement 


4. Trochlear Vertical eye movement 
5. Trigeminal Facial sensation, jaw 
movement 

6. Abducens Horizontal eye movement 
7. Facial Facial movement and taste 
8. Auditory/vestibular Hearing and balance 
9. Glossopharyngeal Taste, swallowing 

10. Vagus Visceral reflexes 


Head movement 
Tongue movement 


11. Accessory 
12. Hypoglossal 





various simple tests of hearing, balance, eye move- 
ment, and so on are used to complete the examina- 
tion of the cranial nerves. 

Although neuropsychologists have some tools 
and procedures suitable for the assessment of dys- 
function in the hindbrain and midbrain, disabilities 
associated with these brain sites are more typically 
diagnosed and treated by medical specialists, par- 
ticularly neurologists. In the main, neuropsycho- 
logical tests and procedures were conceived for the 
assessment of function and dysfunction in the fore- 
brain. Correspondingly, we devote the major share 
of our discussion to the forebrain structures. The 
forebrain consists of the diencephalon, a smaller- 
than-fist-sized set of structures roughly at the cen- 
ter of the brain, and the telencephalon, a more 
massive group of brain structures sitting astride the 
diencephalon (see Table 9.1). 


| DIENCEPHALON 


The structures of the diencephalon include the thal- 
amus, hypothalamus, and pineal body. The pineal 
body is a pea-sized structure that sits at the center 
of the brain. It is known that the pineal body se- 
cretes the hormone melatonin in a cyclic biological 
rhythm, but the exact function of this gland is un- 
clear (Crapo, 1985). Owing to a lack of knowledge 
about its function, contemporary textbooks of neu- 
roscience virtually ignore this structure. From a 
neuropsychological standpoint, the thalamus and 
hypothalamus are much more important. 

The thalamus is a small bifurcated structure 
at the base of the brain. The importance of the 
thalamus cannot be exaggerated: It is the key struc- 
ture that provides sensory input and information 
about ongoing movement to the cerebral cortex. 
In fact, all sensory information except for olfaction 
(the sense of smell) is sent to the thalamus first 
and then projected to specific regions of the cere- 
bral cortex by neural tracts known collectively as 
the thalamocortical radiations. An extensive lesion 
on one side of the thalamus causes gross impair- 
ment of all forms of sensibility on the opposite side 
of the body. Smaller lesions may cause the thresh- 
old for pain to be raised; in addition, painful stim- 
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uli may cause an exaggerated response. Appre- 
ciation of posture and passive movement may 
dramatically diminish with thalamic lesions (Car- 
penter, 1991). 

The thalamus is much more than just a relay 
station; it also plays a key role in the integration 
of neural systems. The thalamus contributes to 
memory functions, attention, speech, and emo- 
tional experience (Kandel, Schwartz, & Jessell, 
1995). Carpenter (1991) asserts that the thalamus is 
the key to understanding the operation of the cere- 
bral cortex. 

The hypothalamus is a deceptively small struc- 
ture that sits just below and in front of the thalamus. 
Even though it composes only about 0.3 percent of 
the brain’s weight, the hypothalamus is involved in 
numerous aspects of motivated behavior and bodily 
regulation: feeding, sexual behavior, sleeping, tem- 
perature regulation, emotional behavior, and move- 
ment. Well studied in lower animals, the functions 
of the hypothalamus are less well known in humans 
(Kolb & Whishaw, 2002). It is known that the hy- 
pothalamus exerts proprietary control over the pitu- 
itary gland, thereby modulating a wide range of 
endocrine functions. The most common cause of a 
hypothalamic lesion is a severe head injury. Hypo- 
thalamic lesions often lead to disturbances of 
pituitary function, including excessive or deficient 
intake of food or water and temperature and blood 
pressure disregulation (Kupfermann, 1991a). Dys- 
function of the hypothalamus also can lead to emo- 
tional dysregulation (especially fear or rage) and 
sleep disturbance (hypersomnolence or insomnia). 

The outermost formation of the forebrain is the 
telencephalon, a structure that is vastly larger than 
the diencephalon. Neuropsychological tests are 
particularly well suited to detecting dysfunction 
within the telencephalon. We present separate dis- 
cussions of the following telencephalic structures: 
limbic lobe, basal ganglia, corpus callosum, and 
cerebral cortex. Please note that some degree of 
oversimplification is unavoidable when discussing 
brain functions. The reader is reminded that the 
brain is richly interconnected with complex and 
poorly understood functional systems. Individual 
brain sites rarely act in isolation. 
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Il] LIMBIC LOBE 


The limbic lobe consists of the hippocampus, sep- 
tum, and cingulate gyrus. These structures line the 
inside wall of the cerebral cortex. The limbic lobe is 
commonly referred to as the limbic system, which 
is probably a misnomer based on early speculations 
now known to be incomplete. In particular, Papez 
(1937) proposed that the structures of the limbic 
lobe constitute a continuous neural circuit responsi- 
ble for the elaboration of emotion and the control of 
visceral activity—hence the reference to the limbic 
system. This view has proved to be oversimplified, 
but the catchy reference to a functional system is 
still with us. It now seems clear that the separate 
structures of the limbic lobe may have many differ- 
ent functions, including a major role in visceral 
function, emotional behavior, and memory. 

One limbic structure, the hippocampus, is 
known to play a special role in memory. Humans 
possess two hippocampi, one in each temporal 
lobe. The hippocampi are part of a complex, ill- 
defined memory circuit that consolidates new 
experiences into long-term memories (Afifi & 
Bergman, 1998). Damage to one hippocampus may 
cause mild deficits in memory. However, persons 
in whom both hippocampi have been damaged or 
surgically removed experience a catastrophic in- 
ability to remember anything new for more than a 
few seconds (Milner, Corkin, & Teuber, 1968). 
These unfortunate individuals retain long-term 
memories acquired before the bilateral hippocam- 
pal destruction, but possess no capacity to convert 
new short-term memories into long-term memo- 
ries. Such persons are prisoners of the moment. 
They might read the same magazine time after time 
and greet the doctor each day as if he or she were a 
total stranger. General intelligence is little affected. 
However, because of their memory deficits, these 
patients invariably require institutionalization or a 
high degree of supervision. 

Unilateral destruction of the hippocampus 
causes selective memory impairments that corre- 
late with well-known hemispheric specializations 
(verbal functions in the left hemisphere, spatial 
functions in the right hemisphere, discussed later). 


Loss of the left hippocampus impairs verbal mem- 
ory, whereas destruction of the right hippocampus 
causes impairment in pictorial and auditory mem- 
ory, particularly with visual and auditory patterns 
that are not easily labeled (Smith & Milner, 1981). 
In contrast to the memory disturbances caused by 
bilateral hippocampal damage, which are profound 
and long-lasting, the memory disturbances caused 
by unilateral damage tend to be transient and sub- 
tle. Sophisticated neuropsychological testing is re- 
quired for their detection (Rausch, 1985). 


| BASAL GANGLIA 


The basal ganglia consist of a collection of nuclei 
in the forebrain that make connections with the 
cerebral cortex above and the thalamus below. The 
basal ganglia are traditionally considered as part of 
the motor system. The main constituents of the 
basal ganglia are three large subcortical nuclei: 
the caudate, the putamen, and the globus pallidus. 
Some authorities also consider the amygdala to be 
part of the basal ganglia (Carpenter, 1991). These 
structures are interconnected with and functionally 
related to the subthalamic nucleus and the substan- 
tia nigra. Along with the cerebellum, the corti- 
cospinal system, and motor nuclei in the brain 
stem, the basal ganglia participate in the control of 
movement. Unlike the other components of the 
motor system, the basal ganglia do not have direct 
connections with the spinal cord. The motor func- 
tions of the basal ganglia are indirect, mediated via 
neural connections with the frontal cerebral cortex. 

The most common syndrome caused by damage 
to the basal ganglia is Parkinson’s disease (Cote & 
Crutcher, 1991). In Parkinson’s disease, three char- 
acteristic types of motor disturbances are observed: 
involuntary movement, including tremor; poverty 
and slowness of movement without paralysis; and 
changes in posture and muscle tone. In its later 
stages, this disease is typified by an immobile, 
masklike facial expression, an extreme difficulty 
initiating movements, and a fine tremor that may 
disappear once a movement is under way. 

Patients with Parkinson’s disease also reveal 
specific cognitive deficits, suggesting that the basal 


ganglia contribute not just to movement, but think- 
ing as well. Deficits observed in these patients in- 
clude diminished cognitive flexibility and mildly 
reduced learning and recall (Koss, 1994). A loss of 
spontaneity and a lack of initiative also are ob- 
served (La Rue, 1992). 


I corpus cattosum 


The corpus callosum is the major commissure that 
serves to integrate the functions of the two cerebral 
hemispheres. This large bundle of subcortical nerve 
fibers is about four inches long and a quarter inch 
thick. The corpus callosum spans the brain from 
side to side just above the level of the thalamus. Al- 
though there are exceptions, the corpus callosum 
generally connects homologous brain sites in the 
left and right hemispheres. 

The function of the corpus callosum was poorly 
understood until the 1960s when Sperry, Gazzaniga, 
and others initiated sophisticated laboratory studies 
of so-called split-brain patients (Sperry, 1964; Gaz- 
zaniga, 1970; Gazzaniga & LeDoux, 1978). These 
patients were persons with epilepsy whose corpus 
callosa had been severed to prevent the transport of 
epileptic discharges from one hemisphere to the 
other. Although outwardly normal, split-brain pa- 
tients revealed a striking isolation of consciousness 
when visual information was restricted to one 
hemisphere or the other. For example, when a pic- 
ture of an apple was tachistoscopically presented to 
the left side of the examinee’s fixation point, this 
stimulus was processed only in the right hemi- 
sphere (on account of the normal crossing over of 
neural connections). Furthermore, because the cor- 
pus callosum was severed, the image of the apple 
remained trapped in the right hemisphere. As the 
reader will discover later, the right hemisphere is 
usually mute and does not subserve important lan- 
guage functions. Thus, when asked, “What did you 
see?” the examinees, responding from the verbal 
left hemisphere, would honestly reply, “Nothing.” 
Yet, these patients could readily identify the ob- 
ject by pointing to it with the left hand (which 
is under the neural control of the right hemisphere). 
This suggests that although the right hemisphere 
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cannot talk, it has a separate and independent ca- 
pacity to perceive, learn, remember, and issue com- 
mands for motor tasks. 

In a normal individual with intact corpus callo- 
sum, consciousness appears unitary because the 
two halves of the brain can communicate and forge 
a compromise as regards perception, thought, and 
action. Much of our knowledge of hemispheric 
specializations, discussed later, has been garnered 
from the detailed study of split-brain patients. Fur- 
ther insight has been gained from studies of persons 
afflicted with congenital absence of this structure, 
a condition known as agenesis of the corpus callo- 
sum. These patients usually have a variety of intel- 
lectual defects, indicating that the corpus callosum 
facilitates learning in many different cognitive 
spheres (Bradshaw & Mattingley, 1995). 


Å. 


| CEREBRAL CORTEX 


The cerebral cortex, the outermost layer of the 
brain, is the source of the highest levels of sensory, 
motor, and cognitive processing. Also called the 
neocortex, the cerebral cortex is a very recent evo- 
lutionary development. It is the functional capacity 
of this brain system—a uniform six layers deep— 
that most dramatically separates humans from the 
lower animals. 

The tissue of the cerebral cortex is folded over 
into elaborate convolutions consisting of bulges and 
grooves. The prominent bulges are called gyri (sin- 
gular gyrus), whereas the clefts, fissures;and grooves 
are called sulci (singular sulcus). This arrangement 
allows the brain to have a great deal more cerebral 
cortex than if the surface were smooth. Although the 
pattern of gyri and sulci is subtly unique for each 
person, certain major landmarks such as the central 
sulcus and the lateral sulcus (Figure 9.2) are always 
discernible in a normal brain. 

A small portion of the cerebral cortex is com- 
mitted cortex. These sites are dedicated to basic 
sensory processing of vision, hearing, touch, and 
motor control. Nonetheless, the specificity of com- 
mitted cortex is relative, not absolute. For example, 
the precentral gyrus is regarded as the motor strip, 
yet only 40 percent of the primary motor cells 
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F = Frontal lobe 
P = Parietal lobe 
T = Temporal lobe 
O = Occipital lobe 


C = Cerebellum 


FIGURE 9.2 

Major Landmarks of 
the Left Cerebral 
Hemisphere 


subserving voluntary movement are located there; 
another 10 to 20 percent are located in the sensory 
strip, and 40 to 50 percent are situated in adjoining 
brain sites (Brodal, 1981). Furthermore, the motor 
strip contains a sizeable proportion of sensory cells, 
too. Thus, cells that subserve each specific sensory 
or motor function are highly concentrated in the re- 
spective committed area, but also thin out and over- 
lap with nearby brain sites. 

The majority of the cerebral cortex is uncom- 
mitted or association cortex. These sites are in- 
volved in the analysis of sensory information and 
the formulation of motor responses. It is the rela- 
tively large proportion of association cortex that 
distinguishes homo sapiens from lower animals. 
Abstract thought, creativity, and problem solving 
are, in large measure, underwritten by these evolu- 
tionary adaptations of the human brain. 

Benson (1994) subdivides association cortex 
into three types: unimodal, heteromodal, and 
supramodal. The unimodal association areas are 
dedicated to a single modality (vision, hearing, 
somesthesis, or movement) and located adjacent to 
the committed cortical areas. Within each unimodal 
association area, incoming stimuli are categorized 
and compared with previous information. Damage 
to the unimodal areas tends to cause difficulties in 
discriminating and categorizing incoming stimuli 
within that modality. For example, a lesion in the 





sulcus 





unimodal visual area will leave vision intact but im- 
pair visual recognition and discrimination. A vari- 
ety of tests involving overlapping figures have been 
designed to assess impairment of this type. For ex- 
ample, a person with damage to the unimodal vi- 
sual association area would be unable to recognize 
the individual shapes in Figure 9.3. The hetero- 
modal association areas possess neural connections 
that travel across modalities. In these areas, cross- 
modal information involving sensory modalities 
(vision, hearing, touch) is intermixed and pro- 
cessed. Damage to heteromodal association cortex 
may cause a variety of well-known syndromes such 
as aphasia, discussed later. Finally, the supramodal 
association cortex, located exclusively in the 
frontal area, is responsible for the control of high- 
level cognition through a process of selection and, 
especially, inhibition. This cortex is phylogeneti- 
cally the most recent and exerts executive control 
over other brain functions. Damage to this area of 
the brain may cause impairments of impulse con- 
trol, planning, and other executive functions. 

In the remainder of this section, we discuss two 
aspects of brain function that occur within the cere- 
bral cortex: (1) localization of function within the 
four major lobes of each cerebral hemisphere, and 
(2) lateralization of function within the left and 
right cerebral hemispheres, with special attention 
to the language functions of the left hemisphere. 





FIGURE 9.3 Examples of Overlapping Figures for 
Testing Visual-Perceptual Dysfunction 

Source: From Gregory, Robert J. Adult intellectual assessment, 
p. 246. © 1987. Published by Allyn and Bacon, Boston, MA. 
Copyright © 1999 by Pearson Education. Adapted by permission 
of the publisher. 


We will also introduce the reader to brain-imaging 
techniques and other clinical tests for brain dys- 
function, because their use often goes hand in hand 
with a neuropsychological evaluation. This discus- 
sion is designed to help the reader understand the 
nature and purpose of the neuropsychological tests 
discussed in the next topic. 


|| FUNCTIONS OF THE CEREBRAL LOBES 


The cerebrum consists of the cerebral cortex and 
underlying structures of the telencephalon. A large 
midline sulcus extending from front to back sepa- 
rates the cerebrum into two roughly symmetrical 
structures known as the left cerebral hemisphere 
and the right cerebral hemisphere. In general, the 
cortex of each hemisphere receives sensory input 
from, and sends motor output to, the opposite or 
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contralateral side of the body. This principle of con- 
tralateral regulation has only one real exception: the 
sense of smell. Olfactory stimulation in each nos- 
tril is processed directly by the same-sided olfac- 
tory bulb sitting just above the nasal cavity. We 
should also mention that the contralateral regula- 
tion of vision is complex: each half of the visual 
field is processed in the opposite cerebral hemi- 
sphere. Furthermore, contralateral regulation. is 
relative, not absolute. For example, the neural con- 
nections from the right ear extend mainly to the left 
hemisphere, but there are some same-side or ipsi- 
lateral projections as well. 

Within each cerebral hemisphere, several 
prominent landmarks can be used to demarcate four 
major lobes. The occipital lobe is at the rear of the 
brain behind the parieto-occipital sulcus; the pari- 
etal lobe is behind the central sulcus; the temporal 
lobe is beneath the lateral sulcus; and the frontal 
lobe is in front of the central sulcus. The occipital- 
parietal and occipital-temporal boundaries are 
somewhat indistinct (see Figure 9.2). 

The same-named lobes on the two sides of the 
brain are roughly symmetrical in structure and also 
share many functions in common. For the moment 
we will emphasize the operative similarities of the 
two halves of the brain as we review the functions 
of the four lobes. However, we should forewarn the 
reader that the left and right cerebral hemispheres 
also possess specialized functions, discussed later. 
Furthermore, these hemispheric specializations often 
correlate in a sensible way with structural varia- 
tions. For example, certain hemispheric structures are 
larger on the left side of the brain, revealing the lat- 
eralization of language functions to this hemisphere. 


Occipital Lobes 


The primary sensory areas for vision are located in 
the occipital lobes; much of this projection area is 
on the mesial or midline surface that separates the 
two cerebral hemispheres. Each occipital lobe sees 
the opposite side of the visual world. Thus, all vi- 
sual stimuli to the left of the reader’s fixation point 
are ultimately processed in the right occipital lobe, 
and vice versa. The split visual world is shared 
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across the splenium, the rearward portion of the 
corpus callosum, producing a unified perception of 
the entire visual field. Damage to the primary visual 
area produces a corresponding loss of visual field 
on the opposite side. For example, an extensive 
lesion in the left occipital lobe would render a 
person blind to the right half of the visual world. A 
very small lesion might produce a scotoma or blind 
spot. 

A thorough visual field examination is crucial 
to the detective work of a comprehensive neuro- 


psychological evaluation. Based on the pattern of 
visual field loss, acompetent examiner can infer the 
location and extent of brain damage. The visual 
system can be likened to transmission cables that 
traverse the brain from the eyes in the front to the 
occipital lobes in the back. Particularly toward the 
rear of the brain, the cables radiate outward to oc- 
cupy significant portions of the subcortical tissue. 
Subcortical lesions toward the rear of the brain 
stand a good chance of disrupting or damaging the 
neural transmission networks. This damage leads 
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FIGURE 9.4 Schematic Diagram of the Visual System and the Effects of Lesions at 


Various Sites 


Source: From Gregory, Robert J. Foundations of intellectual assessment: The WAIS-III and other tests in 
clinical practice, p. 71. Published by Allyn and Bacon, Boston, MA. Copyright © 1999 by Pearson Education. 


Adapted by permission of the publisher. 


to a loss of the associated visual field (Figure 9.4). 
The pattern of visual field loss is a direct clue to the 
site of damage within the occipital lobes and sur- 
rounding tissue. 

The forward portion of each occipital lobe is 
unimodal association cortex. These regions synthe- 
size visual stimuli and produce meaning from them. 
This is where the high-level processing of visual in- 
formation occurs. Damage to the association cortex 
of the occipital lobes may cause visual agnosia, a 
difficulty in the recognition of drawings, objects, or 
faces (Kandel, 1991). Luria (1973) described a typ- 
ical case of a patient with such a lesion: 


The patient carefully examines the picture of a pair 
of spectacles shown to him. He is confused and 
does not know what the picture represents. He 
starts to guess. “There is a circle . . . and another 
circle . . . and a stick . . . a cross-bar . . . why, it 
must be a bicycle?” 
The visual agnosias are especially linked to right- 
sided lesions of occipital association cortex, but 
may also involve impairment of the parietal and 
temporal lobes as well. A particularly dramatic form 
of visual agnosia is prosopagnosia, the inability to 
recognize familiar faces. Benson (1994) cites the 
example of a 70-year-old man who suffered a series 
of strokes affecting the forward portions of the oc- 
cipital lobes. The patient’s chief complaint was that 
he could not recognize his wife or his daughter by 
sight, although he immediately recognized them 
by their voices. In another case of visual agnosia 
known as object agnosia, a patient reproduced a 
drawing of a train with great skill, but had no idea 
what he had drawn. Benson (1988) describes the 
many fascinating symptoms of visual agnosia. 


Parietal Lobes 


Kolb and Whishaw (2002) note that the parietal 
lobes are an artifact of gross anatomical definition; 
that is, they do not reflect cytoarchitectural or phys- 
iological unity. A unitary theory of parietal lobe 
function is therefore impossible. In fact, two inde- 
pendent functions of the parietal lobes can be 
identified, one primarily concerned with the so- 
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matosensory strip located on the postcentral gyrus, 
the other primarily concerned with the rearward or 
association zones of the parietal lobes. 

The postcentral gyrus of each parietal lobe 
mediates opposite-sided awareness of what is hap- 
pening on the surface of our bodies. This tissue, 
appropriately referred to as the somatosensory (or 
somesthetic) strip, receives information about 
touch sensations and the position of the limbs (see 
Figure 9.2). The somatosensory strip and the motor 
strip of the frontal lobe are richly interlinked by 
subcortical neural networks; the topical organiza- 
tion of these intertwined functional systems is very 
similar. The amount of somatosensory tissue de- 
voted to each body location corresponds to the 
functional importance of that area. For example, 
feedback from the lips and tongue is essential for 
speech; the amount of somatosensory tissue de- 
voted to these structures is relatively large. Dam- 
age to the somatosensory strip results in loss of 
sensation for the corresponding parts of the body 
on the opposite side. 

The rearward portion of each parietal lobe is 
heteromodal association cortex. This area seems 
specialized for integrating sensory information 
from nearby somatic, visual, and auditory regions. 
A common symptom of damage to this area is in- 
ability to recognize objects by touch, a symptom 
known as astereognosis. Many neuropsychological 
test batteries incorporate measures of astereogno- 
sis, such as recognizing coins (penny, nickel, dime) 
by touch. This is a cross-modal task insofar as the 
patient must form a visual representation from 
touch alone. This region uses its polymodal in- 
formation to provide visuomotor guidance to the 
limbs, hands, and eyes (Clarke, 1994). 

Damage to the parietal association region on ei- 
ther side may impair the ability to draw. This is one 
form of construction dyspraxia, discussed subse- 
quently. Left-sided damage to this area results in an 
impoverished drawing, as if the person has trouble 
getting the drawing hand to go in the correct direc- 
tion. In contrast, right-sided damage to this area re- 
sults in a perceptual deficit. The person has trouble 
integrating the individual parts into a consistent 
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whole; the overall gestalt of the drawing is lost. The 
association areas of the left and right parietal lobes 
also subserve several lateralized functions, dis- 
cussed later. 


Temporal Lobes 


Four basic functions of the temporal lobes can be 
identified (Kolb & Whishaw, 2002; Bear, 1986). 
These are primary processing of auditory sensa- 
tions, secondary processing of auditory perceptions, 
long-term memory storage, and modulation of bio- 
logical drives (especially aggression, fear, and sexu- 
ality). We discuss each of these functions in turn. 

Each temporal lobe contains a primary auditory 
cortex on its upper portion. Much of the auditory 
cortex is tucked into the lateral sulcus and is not 
visible from a side view of the brain. The left audi- 
tory cortex receives sensory information primarily 
from the right ear and vice versa; however, there is 
some same-sided input as well. A lesion of the pri- 
mary auditory region in one temporal lobe may 
cause a variety of audioperceptual deficits, includ- 
ing inability to detect brief sounds, impaired local- 
ization of sounds, and difficulty discriminating 
speech sounds. These deficits are especially likely 
in the case of damage to the left auditory cortex, as 
discussed later. In rare cases, damage to the audi- 
tory areas can cause deafness even though the au- 
ditory apparatus itself is intact. 

The unimodal auditory association cortex oc- 
cupies much of the external surface of the tempo- 
ral lobe. This associational cortex is important in 
the analysis of complex sounds and rhythmic 
acoustic structures. The auditory association re- 
gions of the left and right temporal lobes are spe- 
cialized for different functions, discussed later. In 
brief, the left side governs language comprehen- 
sion, whereas the right side governs nonverbal au- 
ditory patterns and rhythms. Damage to this region 
in the left temporal lobe causes a disruption of 
language comprehension, while right-sided im- 
pairment causes auditory agnosia, a difficulty in 
recognizing and discriminating nonverbal sounds. 
Albert, Sparks, von Stockert, and Sax (1972) re- 
ported the fascinating case of a 57-year-old man 


who suffered a discrete stroke in the associational 
cortex of the right temporal lobe. This otherwise 
high-functioning gentleman could accurately note 
when a sound was made, but he could not distin- 
guish a telephone ring, a dog bark, the clip-clop of 
horse hooves, whistling, clapping of hands, or 
snapping of fingers. Yet, his perception of language 
sounds (a left temporal lobe function) was intact. 

In conjunction with the hippocampus, the cen- 
termost portions of the temporal cortex help sustain 
long-term memory functions. As discussed later, 
these memory functions are strongly asymmetrical, 
with verbal functions on the left and pictorial 
functions on the right. In brief, lesions of the left 
temporal lobe and/or the underlying hippocampal 
structure cause impairment in delayed recall of ver- 
bal material, such as paragraphs and word lists, 
whether presented visually or orally. Lesions of the 
right temporal lobe and/or the underlying hip- 
pocampal structure cause defects in the delayed 
recall of pictorial material, such as geometric draw- 
ings and faces. Typically, temporal lobe lesions do 
not disturb the immediate recall of verbal or non- 
verbal material. 

Finally, in conjunction with underlying struc- 
tures such as the amygdala, the temporal lobes also 
are involved in biological drives such as aggres- 
sion, fear, and sexuality. Evidence for involvement 
of the temporal lobes in motivation and emotion 
comes from two sources: studies of direct electri- 
cal stimulation of this region, and investigations of 
behavioral alteration in persons with temporal lobe 
seizure disorders. For example, Penfield and Jasper 
(1959) reported that mild electrical stimulation of 
the front and middle temporal cortex produced feel- 
ings of fear, a response also obtained from the 
amygdala. Bear (1986) has catalogued the behav- 
ioral disorders that can result from the motivational 
and emotional dysfunction secondary to temporal 
lobe damage (Table 9.3). These symptoms portray 
the so-called temporal lobe personality, although 
few persons combine all these traits. 

We close our discussion of the temporal lobes 
with a cautionary note. Because the temporal lobes 
are rich in subcortical connections to the parietal, 
frontal, limbic, and occipital lobes of the brain, le- 


TABLE 9.3 A Compilation of Symptoms that 
May Occur in Temporal Lobe Dysfunction 


Compulsive and indiscriminant hypersexuality 

Hyperirritability to trivial slights 

Anxiety and phobic responses 

Paranoid concerns that generalize widely 

Depression and/or agitated euphoric periods 

Preoccupation with religion, cosmology, or philosophy 

Extensive but unproductive writing, drawing, or 
lecturing 

Preoccupation with details 

Circumstantiality, a roundabout loquacious style 

Viscosity, a tendency to prolong social encounters 





Source: Based on Bear, D. M. (1986). Behavioural changes in 
temporal lobe epilepsy. In M. R. Trimble & T. G. Bolwig (Eds.), 
Aspects of epilepsy and psychiatry. New York: Wiley. 


sions in this region can cause many diverse symp- 
toms not catalogued in the preceding. For example, 
temporal lesions may interfere with visual recogni- 
tion. This was first demonstrated by Milner (1958), 
who found that her patients were impaired at rec- 
ognizing visual anomalies (e.g., a monkey in a cage 
with an oil painting on the wall). Kolb, Milner, and 
Taylor (1983) found that patients with right-sided 
temporal lesions failed to perceive that portion of 
a face falling in the left visual field. Kolb and 
Whishaw (1985) describe this interesting symptom: 


Right temporal lobe patients do not appear able 

to perceive subtle social signals such as discreet 
but obvious glances at one’s watch, a gesture 

often intended as a cue to break off a conversation. 
Presumably the patients fail to perceive the signifi- 
cance of the visual signal. 


The interested reader may consult Kolb and Whi- 
shaw (2002) for further discussion of temporal lobe 
dysfunction. 


Frontal Lobes 


The frontal lobes are required for the program- 
ming, regulation, and verification of executive 
functions and motor performance (Luria, 1973; 
Lezak, 1995). Executive functions include goal 
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formulation, planning, carrying out goal-directed 
plans, and efficient performance. It is with the 
frontal lobes that humans create intentions, form 
plans, and regulate their behavior by comparing the 
effects of their actions with the original intentions. 

Enacting a plan requires a bodily movement of 
some kind. People pursue their goals by physically 
manipulating the environment, whether with their 
hands or through the motor activity of speech. It is 
not surprising, then, to find that the primary motor 
cortex is located in the frontal lobes—where plans 
and intentions are also formed. 

The primary motor cortex is found on the pre- 
central gyrus, at the rear of the frontal lobe, just in 
front of the central sulcus. Motor control is oppo- 
site-sided, with the left motor cortex controlling 
bodily movements on the right, and vice versa. The 
topical organization of the motor strip was first 
mapped by Penfield (1958) during a series of oper- 
ations to remove damaged cortical tissue in persons 
with epilepsy. He stimulated different areas of the 
motor cortex with a harmless electrical current to 
map the correspondence between cortex and dif- 
ferent body parts. Penfield found that those areas of 
the body requiring precise control, such as fingers 
and mouth, occupy a disproportionately large 
amount of cortical space. 

Just in front of the primary motor cortex is the 
supplementary motor cortex. The supplementary 
motor cortex is involved in the serial ordering of 
complex motor chains, that is, movement pro- 
gramming. A portion of the frontal lobes just below 
the supplementary motor cortex is involved in the 
control of voluntary eye gaze. The left frontal lobe 
also mediates expressive language, discussed in de- 
tail later. 

Damage to the primary motor cortex causes op- 
posite-sided deficits in fine motor control and also 
reduces the speed and strength of limb movements. 
These effects are easily detected with simple motor 
tests such as finger-tapping speed. Severe damage 
to the motor cortex causes total paralysis of the af- 
fected bodily parts. Damage to the supplementary 
motor cortex causes deficits in the execution of 
motor sequences such as copying a series of arm or 
facial movements (Kolb & Milner, 1981). 
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The most common cause of frontal lobe dam- 
age is closed head injury, which is one type of 
traumatic brain injury. (Traumatic brain injury and 
other prominent causes of adult neuropathology, 
including age-related syndromes, are outlined in 
Table 9.4.) In a closed head injury, acceleration/de- 
celeration forces are instantly applied to the entire 
brain, as when a person’s head strikes the dash- 
board in an automobile accident. Because of the 
irregular surfaces of the surrounding skull, the for- 
ward underside surfaces of the frontal lobes are al- 
most always damaged (Jennett & Teasdale, 1981). 
The front ends of the temporal lobes also are highly 
vulnerable in closed head injury. 

Nauta (1971) summarizes the effects of frontal 
lobe dysfunction as a “derangement of behavioral 
programming.” Lezak (1983, 1995) has catalogued 
the behavioral disturbances that can result from 
generalized, bilateral frontal lobe damage: 


1. Motivational-like problems involving decreased 
spontaneity, decreased productivity, reduced 
rate of behavior, and lack of initiative 

2. Difficulties in making mental shifts and perse- 
veration of activities and responses 

3. Problems in stopping that are often described as 
impulsivity, overreactivity, and difficulty in 
holding back a wrong or unwanted response 

4. Deficits in self-awareness resulting in an inabil- 
ity to perceive performance errors or to size up 
social situations appropriately 

5. A concrete attitude (Goldstein, 1944) in which 
objects, experiences, and behavior are all taken 
at their most obvious face value 


Curiously, frontal lobe lesions may have little 
effect on old learning and well-established skills. 
Both Hebb and Penfield reported that surgical re- 
moval of frontal lobe tissue caused little change in 
IQ scores (Hebb, 1939; Penfield & Evans, 1935). 
Early studies of prefrontal lobotomy demonstrated 
much the same finding: no change in IQ or even a 
slight improvement after dysconnection of the 
frontal lobes. 

Devising adequate measures of frontal lobe 
function has proved to be difficult. Lezak (1995) 


notes that frontal lobe disorders change how a per- 
sons responds, whereas most tests measure what a 
person knows. She has devised an ingenious 
method called the Tinkertoy® Test, discussed in the 
next topic, to assess the programming difficulties 
experienced by persons with frontal lobe lesions. 
More commonly, clinicians rely upon observation 
and checklists to diagnose frontal lobe dysfunction. 
A useful instrument for this purpose is the Check- 
list of Executive Functions (Pollens, McBratnie, & 
Burton, 1988). The items on this checklist define 
important aspects of frontal lobe functions: aware- 
ness, goal setting, planning, self-initiation, self- 
inhibition, self-monitoring, ability to change set, 
and strategic behavior. Each item is rated on a scale 
of 1 to 5 so that the total score provides an index of 
the integrity of executive functions. 


CEREBRAL LATERALIZATION 
OF FUNCTION 


Up to this point, we have stressed the similarities of 
the two halves of the brain, particularly with respect 
to opposite-sided processing of sensory and motor 
functions. In many important respects, however, the 
two hemispheres of the human brain are anatomi- 
cally and functionally asymmetrical (Cytowic, 
1996; Geschwind & Galaburda, 1987). The impor- 
tant structural asymmetries include the following: 


1. The top of the temporal lobe, the planum tem- 
porale, is larger in the left hemisphere. 

2. The left hemisphere contains more gray matter, 
even though it is slightly smaller and lighter than 
the right hemisphere. 

3. The lateral sulcus is much longer in the right 
hemisphere with the result that the parietal-tem- 
poral cortex is slightly enlarged on this side. 


Of course, exceptions will occur in individual 
cases. The differences previously listed are most 
consistent and pronounced in right-handers and 
males. Left-handers and females show fewer struc- 
tural and functional asymmetries of the brain 
(Weekes, 1994). 


TOPIC9A A PRIMER OF NEUROPSYCHOLOGY 319 


TABLE 9.4 Major Neuropathological Conditions of Adulthood and Aging: 


Brief Synopsis and Essential References 





Traumatic Brain Injury 

(Bigler, 1990; Dikmen et al., 1995) 
Description: Neurological consequences depend upon 
the severity of the injury, but all of the following are 
possible: contusions or bruising of the brain under- 
neath the site of impact (coup injury); opposite-sided 
contusions (contrecoup injury); frequent contusions in 
the undersurfaces of the frontal lobes and the tips of 
the temporal lobes; diffuse axonal injury or nonspecific 
damage from shear-strain effects on neural pathways; 
brain tissue damage due to obstructed blood flow; 
hematoma or blood clot between the skull and the sur- 
face of the brain; edema or swelling of the brain; long- 
term consequences include possible shrinkage of the 
brain and corresponding enlargement of the ventricular 
system. 
Potential Neurobehavioral Effects: The most common, 
and reliable, complaints are of concentration and mem- 
ory problems; other generalizations are difficult be- 
cause the nature and severity of the brain damage will 
not be the same in any two patients; focal damage may 
lead to specific symptoms; for example, damage to the 
left hemisphere language areas may cause expressive 
aphasia; many studies suggest that traumatically brain- 
injured patients are more seriously handicapped by 
personality and emotional disturbances than by cogni- 
tive and physical disabilities (Lezak & O’Brien, 1990). 


Neoplastic Disease (Tumor) 

(Reitan & Wolfson, 1993) 
Description: Neoplastic disease encompasses many 
different forms of tumorous growth; for example, 
gliomas are tendril-like tumors of the glial cells that 
infiltrate the brain over a period of weeks or months; 
meningiomas are slower-growing, globular-shaped 
tumors of the meninges (membranes encasing the 
brain) that press down upon the brain. 
Potential Neurobehavioral Effects: Brain tumors 
produce a variety of effects, depending upon their 
location and size; rapidly infiltrating tumors may 
compromise many skills, for example, language and 
problem-solving abilities, motor and sensory ,functions 
on the right side, if the left hemisphere is affected; 
slower-growing meningiomas may lead to focal symp- 
toms that relate to the site of encroachment on the 
brain, for example, deficits in spatial ability and im- 
pairment of motor and sensory functions on the left 
side if the right parietofrontal area is affected. 


Chronic Alcohol Abuse 

(Davila et al., 1994) 
Description: Chronic alcohol ingestion leads to neu- 
ron changes that include a loss of dendritic branches 
and dendritic spines, especially in the hippocampus 
and dentate gyrus; enlargement of the ventricles and 
widening of the cerebral sulci is also observed; in 
severe cases, atrophy of the medial thalamus and 
mamillary bodies is found (Wernicke-Korsakoff’s 
syndrome); the neuropathology of alcoholism 
may be exacerbated by vitamin and nutritional 
deficiencies. 
Potential Neurobehavioral Effects: In cases of severe 
alcohol abuse in which the medial thalamus and 
mamillary bodies are compromised, the profound an- 
terograde amnesia of Wernicke-Korsakoff’s syndrome 
is noted; patients show an inability to retain memory 
of events for more than a short time even though im- 
mediate memory is intact and remote memory is only 
mildly impaired; confabulation or falsification of mem- 
ory with clear consciousness is noted; other symptoms 
of severe abuse include gait disturbance and gaze 
difficulties; in neurologically intact alcoholics, neuro- 
behavioral effects are more elusive, but may include 
subtle memory deficits and difficulties with novel 
problem solving. 


Alzheimer’s Disease 

(Knight, 1992; Koss, 1994) 
Description: The most common degenerative neuro- 
logical disease is Alzheimer’s disease (AD), which 
features an insidious degeneration of the brain; the 
pathophysiology includes clumplike deposits in the 
brain consisting of neuritic plaques and neurofibrillary 
tangles; additional brain changes include neuronal loss, 
shrinkage or atrophy of the brain, depletion of acetyl- 
choline neurotransmitters involved in memory, and 
accumulation of foreign deposits in the cerebral vascu- 
lature; the course of the disease is invariably downhill. 
In 1907 Alois Alzheimer portrayed his initial case as 
follows: 


The first noticeable symptom of illness shown by 
this 51-year-old woman was suspiciousness of her 
husband. Soon, a rapidly increasing memory im- 
pairment became evident; she could no longer ori- 
ent herself in her own dwelling, dragged objects 
here and there and hid them, and at times, believing 
that people were out to murder her, started to 


(continued) 
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TABLE 9.4 Continued 





scream loudly. On observation at the institution, her 
entire demeanor bears the stamp of utter bewilder- 
ment. She is completely disoriented to time and 
place. (La Rue, 1992) 


Although Alzheimer’s disease is not part of normal 
aging, advanced age is an important risk factor; rare 
before age 65, the disease afflicts 3 percent of 

persons 65 to 74 years of age, 18 percent of persons 
75 to 84 years of age, and nearly half of those 85 
years and older (Evans, Funkenstein, Albert, and 
others, 1989). 

Potential Neurobehavioral Effects: As detailed by 
Storandt and Hill (1989), difficulty with the acquisition 
of new information (short-term memory dysfunction) 
is generally the most salient symptom in the early 
stages; patients may also show a prominent language 
dysfunction (e.g., pronounced word-finding difficulty) 
or a striking visuospatial disturbance; reports of per- 
sonality change, including delusions and agitation, are 
also common; the late stages are characterized by 
severe, pervasive disability. 


Vascular Dementia 

(Mirsen & Hachinski, 1988) 
Description: The second most common cause of de- 
mentia in the elderly, vascular dementia is caused by 
blockage of an artery and subsequent death of brain 
tissue because of insufficient blood supply (infarction), 
or bleeding into or around the brain (hemorrhage); 
sudden onset is the rule, but the accumulation of 
small strokes over time, known as multi-infarct 
dementia (MID), may produce an apparently progres- 
sive disorder. The Hachinski Ischemic Score was 
developed to distinguish multi-infarct dementia from 
Alzheimer’s disease (Hachinski, Iliff, Zilha, and 
others, 1975). MID is indicated by the presence of 
several of the following factors: abrupt onset, somatic 
complaints, stepwise deterioration, emotional inconti- 
nence, fluctuating course, history of hypertension, 
nocturnal confusion, history of strokes, personality 
preserved, atherosclerosis present, depression, and 
focal neurological signs. 
Because MID may be partially treatable, the differen- 
tial diagnosis of MID versus AD is more than acade- 
mic; the course of the illness in MID is shorter than in 
AD, but more variable from person to person. 
Potential Neurobehavioral Effects: The stroke syn- 
drome is defined by the acute onset of a focal deficit 
involving the central nervous system; symptoms de- 
pend upon the site of infarction but may include motor 


weakness and impaired sensibility in the opposite 
limbs; nonfluent aphasia may result if the dominant 
hemisphere is affected; stroke in the rear of the brain 
may produce partial loss of the visual field; acute 
symptoms may partially abate and lead to a plateau of 
stable functioning. 


Parkinson’s Disease 

(La Rue, 1992) 
Description: Parkinson’s disease (PD) is almost 
nonexistent before age 40 and affects only 1 or 2 
in 1,000 persons ages 70 and over; primarily a 
movement disorder, but cognitive and emotional 
problems are common; late stages of PD may involve 
a clear dementia; symptoms include slowness of 
movement (bradykinesia), tremor at rest, shuffling 
gait, and postural rigidity; neuropathology involves 
depletion of dopamine and neuron loss in the basal 
ganglia. 
Potential Neurobehavioral Effects: Tremor is the most 
common and the least debilitating early symptom; the 
rate of progression is quite variable, but movement 
disability in PD can become pronounced and lead to 
confinement; 10 to 20 percent of PD patients develop 
a clear dementia; PD patients reveal a deficit on 
neuropsychological tests requiring speed (e.g., Digit 
Symbol, Trail Making, reaction time measures); sur- 
prisingly, tests of visual discrimination and paired- 
associate learning that do not require speed also 
differentiate patients with moderate to severe PD from 
matched controls (Pirozzolo, Hansch, Mortimer, Web- 
ster, & Kuskowski, 1982); about 40 to 60 percent of 
PD patients also experience depression. 


Dementia Syndrome of Depression 

(Blazer, 1993) 
Description: About 10 to 20 percent of depressed 
elderly show cognitive deficits that mimic organic 
dementia; memory loss in combination with severe 
complaints of disability are common features; Demen- 
tia Syndrome of Depression (also known as pseudo- 
dementia) is mainly a post hoc diagnosis defined by 
a return to normal cognitive performance after severe 
depression is treated. 


Potential Neurobehavioral Effects: Memory loss for 
both recent and remote events is common; attention 
and concentration are preserved; social skills may 
show prominent early losses; marked variability of per- 
formance on cognitive tests, even across similar tasks, 
is noted. 





The structural asymmetries between the hemi- 
spheres correlate with the well-known functional 
differences between the two sides of the brain. Lan- 
guage is subserved, in part, by the temporal en- 
largements on the left side, whereas spatial thinking 
is subserved, in part, by the parietal-temporal en- 
largements on the right side. In the remainder of 
this section we will further catalogue the special- 
ties of the left and right hemispheres, forewarning 
the reader that lateralization of function is rela- 
tive, not absolute. For example, both hemispheres 
have some degree of verbal and spatial capacity. 
Furthermore, virtually any high-level intellectual 
activity requires the synthetic interaction of the en- 
tire brain (Efron, 1990). Speech is a case in point. 
While speech is primarily a left hemisphere func- 
tion, the right cerebral hemisphere does provide the 
intonation patterns. As a result, patients with right- 
sided lesions (particularly in the frontal area) may 
speak in an eerie monotone (Gardner, 1975). 


Language Functions of the Left Hemisphere 


Language is primarily (but not exclusively) a left 
hemisphere function that involves widely separated 
cortical and subcortical structures. Because so 
many regions of the left hemisphere are involved in 
language, virtually any significant left hemisphere 
lesion will produce some kind of disturbance in the 
production or comprehension of language. For this 
reason a detailed profile of language skills offers a 
window to the integrity and functioning of the left 
hemisphere. 

Modern conceptions of brain-language corre- 
lations actually stem from the late nineteenth cen- 
tury. In 1861, Paul Broca observed that damage to 
a small region just in front of the motor cortex of 
the left hemisphere caused a language disorder 
originally called expressive aphasia and now more 
typically known as nonfluent aphasia. Persons with 
damage to this left hemisphere premotor area— 
aptly named Broca’s area—speak in a slow, labored 
manner. They have difficulty enunciating words 
correctly; the act of speaking seems to be torturous 
for them. Speech takes on a frankly telegrammatic 
nature; adjectives, adverbs, articles, and conjunc- 
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tions—the words that add color to speech—fre- 
quently are omitted. Writing also is difficult for 
these persons. Fortunately, persons who experience 
Broca’s aphasia have little difficulty understanding 
either spoken or written language. In its pure form, 
the disorder involves expressive language only. 

In 1874, Wernicke announced that damage to 
the upper and rearward portion of the left temporal 
lobe—a region now known as Wernicke’s area— 
was linked to a language disorder originally called 
receptive aphasia and now more typically known as 
fluent aphasia. Affected individuals appear unable 
to comprehend spoken or written language. Appar- 
ently, persons with Wernicke’s aphasia have no dif- 
ficulty perceiving words, but cannot associate the 
words with their underlying meaning. As a conse- 
quence, the written and verbal expressions of per- 
sons with this aphasia are fluent but meaningless. 
For example, when asked to define book, a patient 
might respond, “Book, a husbelt, a king of prepa- 
tor, find it in front of a car ready to be directed.” 
The same person might define scarecrow as, “We’ll 
call that a three-minute resk witch, you'll find one 
in the country in three witches” (Williams, 1979). 

Building on the observations of Broca and Wer- 
nicke, Geschwind (1972) proposed a structural, 
neurological model of left hemisphere language 
functions that has been highly influential in neu- 
ropsychological assessment. This model bears di- 
rectly upon the assessment of language skills; the 
major elements are outlined below and depicted in 
Figure 9.5. Geschwind postulated the following: 


1. Spoken language is perceived in the left audi- 
tory cortex at the top of the temporal lobe, then 
transferred to Wernicke’s area. 

2. In Wernicke’s area, the meanings of words are 
activated and the auditory codes are transported 
to a subcortical bundle of transmission fibers 
called the arcuate fasciculus. 

3. The arcuate fasciculus sends the auditory codes 
directly to Broca’s area. 

4. Upon reaching Broca’s area, the auditory code 
activates the corresponding articulatory code 
that specifies the sequence of muscle actions re- 
quired to pronounce a word. 
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Broca's 
area 
FIGURE 9.5 
The Structural Model of Left Hemi- 
sphere Language Functions 


5. In turn, the articulatory code is transmitted to the 
portions of the motor cortex governing tongue, 
lips, larynx, and so forth in order to produce the 
desired spoken word. 


Comprehending or speaking a written word in- 
volves most of the previously outlined pathways, 
but with a different starting point: 


6. Written words are first registered in the visual 
cortex, then relayed through the visual associa- 
tion cortex to the angular gyrus. 

7. Inthe angular gyrus, the visual form of the word 
is mapped into the auditory code stored in Wer- 
nicke’s area, thereby gaining access to the mean- 
ing of the written word, which can also be 
spoken (steps 2 through 5 previously). 


The Geschwind model is helpful in explain- 
ing a number of clinical syndromes caused by 
discrete left hemisphere brain damage (Gregory, 
1999}: 


¢ Lesions to Broca’s area will cause slow, labored, 
telegraphic speech, but the comprehension of 
spoken or written language will not be affected. 

« Damage to Wernicke’s area will have more 
serious and pervasive implications for lan- 
guage comprehension; namely, the patient will 
be unable to understand spoken or written 
communications. 


Arcuate fasciculus 
(subcortical) 


Angular gyrus 


Visual 
cortex 





Damage to the angular gyrus will cause serious 
reading disability, but there will be little problem 
in comprehending speech or in speaking. 
Impairment limited to the left auditory cortex 
will result in serious disruption of verbal com- 
prehension. However, such persons will be able 
to speak and read normally. 


In practice, few patients reveal aphasic symp- 
toms that fall neatly into one or another of the 
preceding categories. Furthermore, modern concep- 
tions of aphasia point to weaknesses in the classical 
model (e.g., its overly simplistic view of the struc- 
ture of language) and propose a complex, nonlinear 
model of aphasia that is beyond the scope of cover- 
age here (Cytowic, 1996; Whitaker & Kahn, 1994). 
Nonetheless, a thorough assessment of language 
functions is an essential part of every neuropsycho- 
logical evaluation and the classical model of Broca, 
Wernicke, and Geschwind provides a useful starting 
point. Additional perspectives on aphasia and the 
structural model of language can be found in Ben- 
son (1994) and Mayeux and Kandel (1991). 


SPECIALIZED FUNCTIONS 
OF THE RIGHT HEMISPHERE 


Based on thousands of studies of normal and brain- 
damaged persons, it is now well established that the 


right hemisphere is dominant for a variety of cog- 
nitive and perceptual skills. However, a.detailed 
discussion of specialized right hemisphere func- 
tions is beyond the scope of this section. Compe- 
tent reviews of the extensive literature on this topic 
can be found in Bradshaw and Mattingley (1995), 
Joseph (1988), Kupfermann (1991b), Dean (1986), 
and Springer and Deutsch (1997). In general, the 
right hemisphere appears to be dominant for the 
analysis of geometric and visual space, the com- 
prehension and expression of emotion, the pro- 
cessing of music and nonverbal environmental 
sounds, the production of nonverbal and spatial 
memories, and the tactual recognition of complex 
shapes. 

A frequent symptom of right hemisphere dam- 
age is constructional dyspraxia, the impaired 
ability to deal with spatial relationships either in a 
two- or three-dimensional framework (Reitan & 
Wolfson, 1993). This symptom is commonly ex- 
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hibited by an impaired ability to copy simple 
shapes such as a cross. Left hemisphere lesions can 
also cause constructional dyspraxia, but the corre- 
lation is less consistent. Most neuropsychological 
test batteries include one or more copying tasks to 
screen for constructional dyspraxia. We include a 
summary of findings on cerebral lateralization in 
Table 9.5. 


BRAIN-IMAGING TECHNIQUES 


Several clinical tests and brain-imaging techniques 
have been invented to assist in neurological diag- 
nosis (Cohen & Bookheimer, 1994; Kandel, 
Schwartz, & Jessell, 1995; Raichle, 1994). Because 
neuropsychological testing often goes hand in hand 
with these medical procedures, we provide a brief 
introduction to the most widely used clinical tests 


|] CLINICAL TESTS AND 


TABLE 9.5 A Summary of Findings on Cerebral Lateralization 


Functional 
System Left Hemisphere Dominance Right Hemisphere Dominance 

Vision Processing of the right visual field Processing of the left visual field 
Recognition of letters, words Recognition of faces 

Audition Processing of right ear Processing of left ear 
Processing of language-related Processing of music and envi- 
sounds ronmental sounds 

Somatosensory Sensory input from the right side Sensory input from the left side 

Movement Motor output to the right side Motor output to the left side 
Complex voluntary movement, 
including speech 

Language Speech, reading, writing, and Intonation and emotional 
arithmetic patterning to speech 

Memory Verbal memory Pictorial memory 

Spatial Analysis of geometric and 

processes visual space 

Emotion Comprehension and expression 

of emotion 
Olfaction Smell in left nostril Smell in right nostril 
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TABLE 9.6 Clinical Tests and Brain-Imaging Techniques Useful in Neurological Diagnosis 





Electroencephalography (EEG) 

The EEG produces a record of electrical activity of the 
cortex from electrodes posted on specific areas of the 
skull. The fluctuations in activity are depicted as sepa- 
rate ink lines drawn on a continuous roll of paper. The 
EEG is a crude index, since the moment-to-moment 
fluctuations of each ink line reflect the synchronized 
activity of millions of neurons. The test is useful in 
diagnosing seizure disorders and localizing abnormal 
brain activity such as caused by a tumor. 


Cerebral Angiography 

In this technique a special radio-opaque dye is injected 
into a major artery that supplies the brain (vertebral or 
carotid artery) and then the brain is X-rayed. Because 
the dye blocks the X rays, the arterial system of the 
brain stands out in stark relief on the negative. The 
physician therefore can locate vascular anomalies such 
as an aneurysm (a dangerous ballooning of an artery). 
Also, if an artery is displaced from its normal location, 
the specialist can infer underlying pathology such as 

a tumor. Traditional angiography presents a slight risk 
to the patient, because the injection of the dye can 
cause neurological complications. With continued 
advances in technique, magnetic resonance angiogra- 
phy will likely supplant traditional approaches. 


Computerized Tomography (CT) 

In a CT scan, a narrow beam of X rays is passed 
through the brain from dozens of different angles; the 
machine detects the amount of X rays emerging from 
the other side. The density of different internal struc- 
tures appears on X ray films in inverse proportion to 
their absorption of X rays. A computer works out the 
mathematics to reconstruct a three-dimensional repre- 
sentation of internal brain densities, thereby revealing 
important structures. The computer prints eight or so 
two-dimensional cross-sectional X rays of the brain, 
each from a different plane. CT produces a resolution 
of less than 1 millimeter; tumors, blood clots, and ven- 
tricular displacements are easily seen. CT is less harm- 
ful than a traditional chest X ray. 


Positron Emission Tomography (PET) 

In a PET scan, the patient is injected with a radio- 
actively tagged form of glucose, an essential meta- 
bolic fuel used by the brain. The level of radioactivity 
is extremely scant and not considered harmful. The 
radioactivity is then monitored by a special detector 
surrounding the patient’s skull. Because the gluc- 
ose goes to the most active parts of the brain, a PET 
scan measures activity level and not structure per se. 
Thus, a PET scan canbe used to gauge regional 
cerebral activity, which can be helpful in the diag- 
nosis of Alzheimer’s disease, schizophrenia, and 
other brain-impairing conditions. PET scans also 
can be used to map receptor sites by having the pa- 
tient inhale a radioactive gas which binds to spec- 
ific receptor sites such as the dopamine receptors of 
the basal ganglia. The major drawback to PET is 

the level of technology required. Some applications 
require a nearby cyclotron for creation of short-lived 
isotopes. 


Magnetic Resonance Imaging (MRI) 

The functional principle of MRI is that certain atoms 
such as hydrogen behave like tiny spinning magnets. 
When placed in a strong magnetic field, these atoms 
will line up with one another. When radio waves 

are then beamed across the atoms at right angles to 
the magnetic field, the atoms wobble synchronously 
with one another. As the radio waves are turned off, 
a wire coil surrounding the skull will detect a volt- 
age or magnetic resonance in the magnetic field, the 
voltage being stronger in areas that contain higher 
concentrations of hydrogen. Because many parts 

of the brain are hydrogen-rich (especially those that 
contain water, or H,O), the differential pattern of 
magnetic resonance helps reveal underlying brain 
structures. The spatial resolution of MRI is so keen 
that the images resemble fixed and sectioned anatom- 
ical material. A new procedure known as magnetic 
resonance angiography promises to replace the more 
dangerous and invasive procedures of traditional 
cerebral angiography. 





and brain-imaging techniques (Table 9.6). These 
procedures are authorized and used exclusively by 
neurologists and other medical practitioners who 
specialize in diseases of the nervous system. In ad- 
dition to an office examination that focuses upon 
the patient’s history, mental state, neural reflexes, 


sensory functioning, and motor skill, a neurologist 
commonly uses one or more clinical tests to help 
diagnose or rule out neurological disease. Clinical 
tests are essential in arriving at a correct diagnosis 
of the patient’s medical condition. However, they 
do not replace neuropsychological testing, which is 


needed to illuminate the functional consequences 
of neurological conditions. 


APPLICATIONS OF 
NEUROPSYCHOLOGICAL TEST 
FINDINGS 


Now that we have reviewed the essentials of human 
brain structure and function, the reader should be in 
a better position to comprehend the neurological 
meaning of psychological test results. But there is 
more to neuropsychological assessment than mere 
understanding of brain function. We concur with 
Kolb and Whishaw (2002) that the value of neu- 
ropsychology lies not only in its contribution to an 
understanding of brain function, but also in the 
applicability of this basic knowledge to human 
problems. To illustrate these interlocking points— 
that neuropsychologically based testing leads to 
understanding and that understanding leads to ap- 
plication—we return to the case of the failing pre- 
medical student. 

The reader may wish to examine the test results 
for the failing premedical student introduced at the 
beginning of this chapter (Case Exhibit 9.1) in light 
of the knowledge gained from the preceding pages. 
By way of quick review, the student was failing 
courses such as organic chemistry and embryology 
in spite of his superior Full Scale IQ of 122. In ad- 
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dition to his suspiciously low score on a test of ab- 
stract thinking (Category Test), the student could not 
accurately copy a Greek cross, and his finger-tapping 
speed for the left hand was comparatively quite slow. 
The reader may recognize the first symptom—diffi- 
culty copying a shape—as constructional dyspraxia, 
which often indicates right hemisphere impairment. 
Because constructional dyspraxia signals a severe 
weakness in dealing with spatial relationships, this 
student’s difficulty with organic chemistry and 
embryology was not surprising. Furthermore, the 
second symptom—motor slowing in the left hand— 
also suggests right hemisphere impairment. The 
nondominant hand should be about 10 percent 
slower than the dominant hand (Reitan & Wolfson, 
1993). In this instance, the nondominant hand is 25 
percent slower than the dominant hand (56 versus 
42). Although the cause remains a mystery, a CT 
scan confirmed that the premedical student had in- 
curred a static lesion in the frontal-parietal region of 
the right hemisphere. Of course, this fact could have 
been revealed by use of the CT scan alone, without 
reference to neuropsychological test results. None- 
theless, the test battery served a useful purpose by 
documenting the functional consequences of brain 
damage. Incidentally, the student switched majors to 
history—an academic pursuit more compatible with 
his left hemisphere strengths—and graduated with a 
degree in education. 


SUMMARY : 


1. Neuropsychology is the study of the rela- 
tionship between brain function and behavior. In 
_neuropsychological assessment, tests sensitive to 
brain dysfunction are used for individual assess- 
ment. 


2. The central nervous system consists of the 
brain and spinal cord. The peripheral nervous sys- 
tem includes the cranial nerves and the network of 
nerves emanating from the spinal cord. The brain 
consists of 10!! neurons and an even larger number 
of glial cells which provide structural support. 


3. The lowest part of the brain is the hind- 
brain, which contains the medulla oblongata. Nerve 
fibers from higher parts of the brain pass through 
here and cross over on their way to the spinal 
cord. The medulla helps control swallowing, vom- 
iting, breathing, blood pressure, respiration, and 
heart rate. 


4. The reticular formation passes through the 
medulla. Specific nuclei within the reticular for- 
mation project to wide areas of the brain and 
thereby help mediate complex postural reflexes and 
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muscle tone. Portions of this structure known as the 
reticular activating system are known to govern 
general arousal or consciousness. 


5. Located within the midbrain, the tegmen- 
tum is a relay station for sensory and motor fibers 
and also contains many of the cranial nerves. The 12 
paired cranial nerves help govern the major senses 
(smell, vision, taste, hearing) and are also involved 
in the movement of the face and tongue. 


6. The structures of the diencephalon in- 
clude the thalamus, hypothalamus, and pineal 
body. The function of the pineal gland is obscure, 
although it is known to help regulate cyclic bio- 
logic rhythms through release of the hormone 
melatonin. 


7. The thalamus is a key structure that pro- 
vides sensory input and information about ongoing 
movement to the cerebral cortex. The thalamus 
may also be involved in memory, attention, speech, 
and emotional experience, either directly or indi- 
rectly through its role as a relay station to distant 
cerebral sites. 


8. The hypothalamus is involved in numer- 
ous aspects of motivated behavior, including 
feeding, sexual behavior, sleeping, temperature reg- 
ulation, emotional behavior, and movement. The hy- 
pothalamus also controls the pituitary gland, thereby 
modulating a wide range of endocrine functions. 


9. The outermost formation of the forebrain 
is the telencephalon, which includes the limbic 
lobe, basal ganglia, corpus callosum, and cerebral 
cortex. The limbic lobe, including the hippocam- 
pus, septum, and cingulate gyrus, is involved in the 
regulation of emotion. The hippocampus plays a 
crucial role in the consolidation of new experiences 
into long-term memories. 


10. The basal ganglia, including the caudate, 
putamen, and globus pallidus, help govern move- 
ment. Parkinson’s disease is a degenerative disorder 
of the basal ganglia characterized by involuntary 
movement, including tremor, poverty and slowness 
of movement without paralysis, and changes in 
posture and muscle tone. 


11. The corpus callosum is the major commis- 
sure that serves to integrate the functions of the two 
hemispheres. Much has been learned about brain 
function by studying persons with epilepsy in 
whom the corpus callosum has been surgically sev- 
ered for therapeutic purposes. 


12. The outermost layer of the brain is the cere- 
bral cortex or neocortex. Committed cerebral cor- 
tex consists of dedicated sites for basic sensory 
processing of vision, hearing, touch, and motor 
control. But the majority of the cerebral cortex is 
uncommitted or association cortex, used in the 
analysis of sensory information and the formula- 
tion of motor responses. 


13. The frontal lobes are involved in program- 
ming and regulation of goal planning, including 
bodily movement; the parietal lobes mediate oppo- 
site-sided bodily awareness and help with spatial 
tasks; the occipital lobes are involved in vision; and 
the temporal lobes are involved in auditory pro- 
cessing, memory, and biological drives such as ag- 
gression, fear, and sexuality. 


14. In most persons, the left cerebral hemi- 
sphere is dominant for language. Expressive com- 
ponents of language are subserved by Broca’s area 
in the frontal lobe, while receptive components of 
language are mediated by Wernicke’s area in the 
temporal lobe. 


15. Specialized functions of the right cerebral 
hemisphere include the analysis of geometric and 
visual space, the comprehension and expression of 
emotion, the processing of music and nonverbal en- 
vironmental sounds, the production of nonverbal 
memories, and the tactual recognition of complex 
shapes. 


16. Several clinical tests and brain-imaging 
techniques can be used to help diagnose neuro- 
logical diseases. These include EEG, cerebral 
arteriography, computerized tomography, positron 
emission tomography, magnetic resonance imag- 
ing, and magnetic resonance angiography. 


KEY TERMS AND CONCEPTS 


neuropsychology p. 304 
ventricles p. 306 
hindbrain p. 307 

medulla oblongata p. 307 
reticular formation p. 307 
cerebellum p. 307 
dysarthria p. 308 
midbrain p. 308 

cranial nerves p. 308 
forebrain p. 309 

pineal body p. 309 
thalamus p. 309 
hypothalamus p. 309 


TOPIC9A A PRIMER OF NEUROPSYCHOLOGY 


limbic lobe p. 310 
hippocampus, p.310 
basal ganglia- p. 310 
Parkinson’s disease p. 310 
corpus callosum p.311 
cerebral cortex p.311 
occipital lobes p. 313 
visual agnosia p: 315 
parietal lobes p. 315 
temporal lobes. p. 316 
frontal lobes p. 317 


constructional dyspraxia p. 323 


Topic 9B Neuropsychological and 
Geriatric Assessment 


A Conceptual Model of Brain-Behavior Relationships 
Assessment of Sensory Input 
Measures of Attention and Concentration 
Tests of Learning and Memory 
Assessment of Language Functions 

Tests of Spatial and Manipulatory Ability 
Assessment of Executive Functions 
Assessment of Motor Output 
Test Batteries in Neuropsychological Assessment 

Case Exhibit 9.2  Luria-Nebraska Neuropsychological Battery 
Assessment of Mental Status in the Elderly 


Summary 
Key Terms and Concepts 


N: tests and procedures en- 
compass an eclectic assortment of methods 
and purposes. At one end of the spectrum are sim- 
ple, 10-minute screening tests used to probe the 
need for further assessment. At the other end of the 
spectrum are exhaustive, six-hour test batteries de- 
signed to provide a comprehensive assessment. In 
between are hundreds of specialized instruments 
developed to measure particular neuropsychologi- 
cal abilities. At first glance, this multitude of tests 
would appear to resist simple categorization, as if 
researchers in this area had followed an incoherent 
philosophy of trial and error in the development of 
new instruments and procedures. However, with 
closer scrutiny it is evident that most neuropsycho- 
logical tests fit within a simple, logical model of 
brain-behavior relationships. We will use this 
model as a framework for discussing well-known 
neuropsychological tests and procedures. 
Neuropsychological assessment involves more 
than the administration and scoring of specialized 
tests. An important component of any assessment 
is an evaluation of the client’s mental status. This 
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is particularly true with elderly clients who may be 
experiencing Alzheimer’s disease or other forms of 
dementia. Accordingly, we close this chapter with 
an emphasis upon mental status assessment in the 
elderly. 


| | A CONCEPTUAL MODEL OF 
| BRAIN-BEHAVIOR RELATIONSHIPS 


Bennett (1988) has proposed a simplified model 
of brain-behavior relationships that is helpful in 
organizing the seemingly chaotic profusion of 
neuropsychological tests (Figure 9.6). His con- 
ceptualization is a slight expansion of the model 
presented by Reitan and Wolfson (1993). Accord- 
ing to this view, each neuropsychological test or 
procedure evaluates one or more of the following 
categories: 





1. Sensory input 

2. Attention and concentration 
3. Learning and memory 

4. Language 
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5. Spatial and manipulatory ability 
6. Executive functions: 

Logical analysis 

Concept formation 

Reasoning 

Planning 

Flexibility of thinking 
7. Motor output 


The order of the categories listed corresponds 
roughly to the order in which incoming informa- 
tion is analyzed by the brain in preparation for a re- 
sponse or motor output. 

In the remainder of this topic, the discussion 
of neuropsychological tests and procedures is or- 
ganized around these seven categories. Within 
each category we will review established tests and 
also introduce new instruments that show promise 
of extending the horizons of neuropsychological 
assessment. However, the reader needs to know 
that neuropsychological assessment commonly 
involves a battery of tests. One approach is flexi- 
ble or patient-centered testing in which an indi- 
vidualized test battery is fashioned for each client. 
These batteries are based upon the presenting 
complaints, referral issues, and an initial assess- 
ment (Goodglass, 1986; Kane, 1991). More typi- 
cally, neuropsychologists employ a fixed battery 
of tests for most referrals. One of the most widely 
used fixed batteries, the Halstead-Reitan Neu- 
ropsychological Battery, is outlined in Table 9.7. 
The chapter closes with an illustration of how 
another well-known fixed battery, the Luria- 
Nebraska Neuropsychological Battery, is used in 
assessment. 


|] ASSESSMENT OF SENSORY INPUT 


The accuracy of sensory input is crucial to the pro- 
ficiency of perception, thought, plans, and action. 
An individual who does not see stimuli correctly, 
hear sounds accurately, or process touch reliably 
may encounter additional handicaps at higher lev- 
els of perception and cognition. Neuropsychologi- 
cal assessment always incorporates a multimodal 
examination of sensory capacities. 


Executive Functions: 
Logical Analysis, 
Concept Formation, 
Reasoning, Planning, 
Flexibility of Thinking 





Visuospatial, 
Visuoconstructive and 
Manipulospatial Skills 


Language Skills 


Attention and 
Concentration 


Sensory Input 


FIGURE 9.6 Conceptual Model of Brain-Behavior 
Relationships 

Source: Based on Reitan and Wolfson (1993) and reprinted with 
permission from Bennett, T. (1988). Use of the Halstead-Reitan 
Neuropsychological Test Battery in the assessment of head injury. 
Cognitive Rehabilitation, 6, 18-25. 





Sensory-Perceptual Exam 


The procedures developed by Reitan and Klove 
are entirely typical of sensory-perceptual procedures 
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TABLE 9.7 Tests and Procedures of the Halstead-Reitan Test Battery 





Test 


Category Test* 


Tactual Performance Test* 


Speech Sounds Perception Test* 


Seashore Rhythm Test* 


Finger-Tapping Test* 


Grip Strength 


Trail Making, parts A, B 


Tactile Form Recognition 


Sensory-Perceptual Exam 


Aphasia Screening Test 


Supplementary 


Description 


Measures abstract reasoning and concept formation; 
requires examinee to find the rule for categorizing 
pictures of geometric shapes 

Measures kinesthetic and sensorimotor ability; re- 
quires blindfolded examinee to place blocks in ap- 
propriate cutout on an upright board with dominant 
hand, then nondominant hand, then both hands; also 
tests for incidental memory of blocks 

Measures attention and auditory-visual synthesis; re- 
quires examinee to pick from four choices the written 
version of taped nonsense words 

Measures attention and auditory perception; requires 
examinee to indicate whether paired musical rhythms 
are same or different 

Measures motor speed; requires examinee to tap a 
telegraph keylike lever as quickly as possible for 

10 seconds 

Measures grip strength with dynamometer; requires 
examinee to squeeze as hard as possible; separate tri- 
als with each hand 

Measures scanning ability, mental flexibility, and 
speed; requires examinee to connect numbers 

(part A) or numbers and letters in alternating order 
(part B) with a pencil line under pressure of time 
Measures sensory-perceptual ability; requires exami- 
nee to recognize simple shapes (e.g., triangle) placed 
in the palm of the hand 

Measures sensory-perceptual ability; requires exami- 
nee to respond to simple bilateral sensory tasks, e.g., 
detecting which finger has been touched, which ear 
has received a brief sound; assesses the visual fields 
Measures expressive and receptive language abilities; 
tasks include naming a pictured item (e.g., fork) re- 
peating short phrases; copying tasks (not a measure 
of aphasia) included here for historical reasons 
WAIS-III, WRAT-3, MMPI-2, memory tests such as 
Wechsler Memory Scale-III or Rey Auditory Verbal 
Learning Test 





*Strictly speaking, these five measures constitute the Halstead-Reitan Test Battery. However, in common 
parlance reference to the Halstead-Reitan includes all of the measures listed in the table. 


(Reitan, 1984, 1985). The Reitan-Klove Sensory- 
Perceptual Examination consists of several meth- 
ods for delivering unilateral and bilateral 
stimulation in the modalities of touch, hearing, and 


vision. The tasks are so simple that normal persons 
seldom make any errors at all. For example, the ex- 
aminee is asked to say which hand has been 
touched (with eyes closed), or to report which ear 
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has received a barely audible finger snap, or to 
identify which number has been traced on the fin- 
gertip. The results of this test are especially diag- 
nostic if the examinee consistently makes more 
errors on one side of the body than the other. The 
reader will recall from the previous chapter that 
neural innervation is almost exclusively opposite- 
sided. Furthermore, certain areas of the cerebral 
cortex are devoted to primary processing of touch, 
hearing, and vision. Thus, an examinee who finds 
it difficult to process touch in the right hand may 
have a lesion in the postcentral gyrus of the left 
parietal lobe. Similarly, difficulty processing sound 
in the right ear may indicate a lesion in the superior 
portion of the left temporal lobe, and right-sided vi- 
sual defects may indicate brain impairment in the 
left occipital lobe. 


Finger Localization Test 


Finger localization is a venerable procedure devel- 
oped by neurologists to evaluate possible sensory 
losses caused by impairment of brain functions. 
Most neuropsychological test batteries employ a 
variant of this test, in which examinees must iden- 
tify those fingers that have been touched (without 
benefit of sight). Benton has developed a well- 
normed 60-item test of finger localization that con- 
sists of three parts: (1) with the hand visible, 
identifying single fingers touched by the examiner 
with the pointed end of a pencil (10 trials each 
hand); (2) with the hand hidden from view, identi- 
fying single fingers touched by the examiner (10 tri- 
als each hand); (3) with the hand hidden from view, 
identifying pairs of fingers simultaneously touched 
by the examiner (10 trials each hand). The method 
of response is left to the patient: naming, touching, 
or pointing to fingers on a diagram (Benton, Sivan, 
Hamsher, Varney, & Spreen, 1994). Each stimulus 
presentation is scored right or wrong, and normal 
adults typically make very few errors in the 60 tri- 
als. Mean scores for normal adults are near perfect, 
ranging from 56 to 60 in various samples. In con- 
trast, patients with brain disease find finger local- 
ization to be a challenging task, particularly on the 
second and third parts of the test. 


MEASURES OF ATTENTION 
AND CONCENTRATION 


The attentional capacity of the brain makes it pos- 
sible to attend to meaningful stimuli, screen irrele- 
vant sensory input from the profusion of incoming 
stimuli, and flexibly shift to alternative stimuli 
when conditions demand it (Kinsbourne, 1994), 
While in theory it might be possible to make sub- 
tle distinctions between simple attention, concen- 
tration, mental shifting, mental tracking, vigilance, 
and other variants of attention/concentration, in 
practice these skills are difficult to separate. Only 
one attentional measure—the Test of Everyday At- 
tention (TEA)—has succeeded in partitioning at- 
tention into its component sources. We discuss the 
TEA and other prominent measures of attentional 
impairment in the following sections. 


Test of Everyday Attention 


The Test of Everyday Attention (TEA) is a promis- 
ing measure devised in Great Britain by Robertson, 
Ward, Ridgeway, and NimmoSmith (1994, 1996). 
The TEA measures the subcomponents of attention, 
including sustained attention, selective attention, 
divided attention, and attentional switching. The 
subtests of the TEA are outlined in Table 9.8. The 
test has three parallel versions and has been well 
validated with closed head injury clients, stroke pa- 
tients, and persons with Alzheimer’s disease. Nor- 
mative data are based upon the performance of 154 
healthy individuals between the ages of 18 and 80. 
Examinees enjoy the real-life scenarios of the TEA, 
which adds to the ecological validity of the instru- 
ment. The TEA is highly sensitive to normal age ef- 
fects in the general population and is therefore well 
suited to geriatric assessment. With the exception 
of the Elevator Counting subtest, the eight subtests 
were standardized to yield equivalent scores with a 
common mean of 10 and standard deviation of 3. 
Thus, the TEA allows for subtest analysis as a 
means of identifying an individual’s particular 
strengths and weaknesses (Crawford, Sommerville, 
& Robertson, 1997). The TEA is highly sensitive to 
the effects of closed head injury (Chan, 2000), with 
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TABLE 9.8 Subtests of the Test of Everyday 
Attention (TEA) 





Map Search; A two-minute speeded search for 80 sym- 
bols on a colored map; measures selective attention. 
Elevator Counting: Simulation of elevator floor count- 
ing from tape-presented tones; measures sustained 
attention. 

Elevator Counting with Distraction: Same as above but 
with auditory distractors; measures sustained 

attention. 

Visual Elevator: Visual simulation of elevator floor 
counting with up-down reversals; measures attentional 
switching. 

Auditory Elevator with Reversal: Same as visual eleva- 
tor, except it is presented on tape; measures attentional 
switching. 

Telephone Search: Search for key symbols while 
searching entries in a simulated classified telephone di- 
rectory; measures divided attention. 

Telephone Search Dual Task: Combines Telephone 
Search with simultaneous counting of auditory tones; 
measures divided attention. 

Lottery: Subject listens for winning numbers known to 
end in 55 and then writes down preceding stimuli; mea- 
sures sustained attention. 





the Map Search and Telephone Search subtests re- 
vealing the largest deficits from brain injury (Bate, 
Mathias, & Crawford, 2001). 


Continuous Performance Test 


The Continuous Performance Test (CPT) is not re- 
ally a single test but rather a family of similar pro- 
cedures that dates back to the pathbreaking 
research of Rosvold, Mirsky, Sarason, and others 
(1956). These authors devised a measure of sus- 
tained attention (also called vigilance) that involved 
continuous presentation of letters on a screen. In 
some cases, examinees were to press a key when a 
certain letter appeared (e.g., x). In other instances, 
examinees were to press a key when a certain let- 
ter appeared after another letter (e.g., x when it oc- 
curs after a). Errors of omission are noted when the 
examinee fails to press for a target stimulus. Errors 
of commission are noted when the examinee 


presses the key for a nontarget stimulus. Normal 
subjects make few errors. 

Although CPT tests are sensitive to a wide va- 
riety of brain-impairing conditions including hy- 
peractivity, drug effects, schizophrenia, and overt 
brain damage, these tests are not a panacea for the 
diagnosis of attention-deficit disorders. For exam- 
ple, in one study of the popular Conners (1995) 
CPT, children with diagnosed Attention-Deficit/ 
Hyperactivity Disorder (ADHD) did not score 
worse than clinical controls; on the other hand, 
children with diagnosed reading disorders showed 
impaired performance on the CPT (McGee, Clark, 
& Symons, 2000). In general, reviewers recom- 
mend that CPT tests should be interpreted in the 
context of a comprehensive test battery, especially 
when they are used in the assessment of persons 
with suspected attentional problems (Riccio, 
Reynolds, & Lowe, 2001). 

The CPT is ideal for computerized adaptation, 
and dozens of different versions of it have appeared 
in the literature (e.g., Conners, 1995; Gordon & 
Mettelman, 1988). Unfortunately, the proliferation 
of similar but not identical tests has hindered re- 
search on the practical utility of this promising 
measure of attention. Recently, Sandford and 
Turner (1997) have published a computerized CPT 
that uses both visual and auditory stimuli. The In- 
termediate Visual and Auditory Continuous Perfor- 
mance Test (IVA) is normed on 781 normal persons 
ranging from 5 to 90 years of age and screened for 
attention deficit, learning difficulties, emotional 
problems, and medication use. In one analysis, the 
IVA showed 92 percent sensitivity (i.e., an 8 per- 
cent rate of false negatives) and 90 percent speci- 
ficity (i.e., a 10 percent rate of false positives) in 
differentiating children diagnosed with Attention- 
Deficit/Hyperactivity Disorder (ADHD) from nor- 
mal children. This instrument is just one of many 
promising neuropsychological tests that takes ad- 
vantage of microcomputer technology. 


Paced Auditory Serial Addition Task 


Considering its utter simplicity, the Paced Auditory 
Serial Addition Task (PASAT) is an extremely 
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sensitive index of mental tracking (Gronwall & 
Wrightson, 1974; Gronwall & Sampson, 1974). 
The examinee listens to a series of digits presented 
by audiotape and adds together each successive pair 
of digits. Thus, if the numbers presented are “3-1- 
9-5-4,” the examinee should respond “4-10-14-9.” 

The PASAT begins with a 10-digit practice se- 
ries, with a new digit presented every 2.4 seconds. 
The actual test consists of 61 stimuli (hence re- 
quiring 60 additions) at each of four presentation 
speeds: 2.4, 2.0, 1.6, and 1.2 seconds between dig- 
its. By computing the percent correct at each of the 
presentation rates, the examiner obtains four scores 
on the PASAT. 

Even though the format of the PASAT is sim- 
ple, the information-processing demands of this 
test are quite burdensome. In order to perform well, 
the examinee must hold two numbers in short-term 
memory, perform a mental addition, speak the an- 
swer, retain in short-term memory only the last of 
the two numbers, annex the latest digit to short- 
term memory, and then start the cycle over again. 
Persons with impaired brain functions find this 
mental juggling to be cognitively overwhelming. 

Gronwall recommends the PASAT for serial 
testing of concussion patients (Gronwall, 1977; 
Gronwall & Wrightson, 1981). Briefly defined, a 
concussion is a transitory alteration of conscious- 
ness from a blow to the head. A concussion may be 
followed by temporary amnesia, dizziness, nausea, 
weak pulse, and slow respiration, yet there is no 
demonstrable organic brain damage (McMordie, 
1988). It is widely recognized that the PASAT is 
very sensitive to the effects of concussion (Stuss, 
Stethem, Hugenholtz, & Richard, 1989). However, 
a crucial issue in concussion is how long the patient 
should recuperate. When successive PASAT scores 
finally return to the normal range—which might 
take several days or several weeks—the therapist 
can have increased confidence that the patient is 
ready to return to work. 

Several versions of the PASAT are available, 
including some that are audiotapes of spoken num- 
bers, and others that are computer-generated pre- 
sentations with precise control of timing (e.g., 
Wingenfeld, Holdwick, Davis, & Hunter, 1999). Of 


course, norms for the PASAT are specific to each 
version (e.g., Wiens, Fuller, & Crossen, 1997). 

Largely on humanistic grounds, Lezak (1995) 
warns against routine use of the PASAT. In her ex- 
perience, even cognitively intact persons experi- 
ence the test as very stressful, feeling that they are 
failing even when their performance is normal. In- 
sofar as there are easier ways to demonstrate atten- 
tional impairment, she recommends use of the 
PASAT only in special circumstances: 


I keep it available for those times when subtle at- 
tentional deficits need to be made obvious to the 
most hide-bound skeptics for some purpose very 
much in the patient’s interest; and then I prepare 
these patients beforehand, letting them know that it 
can be an unpleasant procedure and that they may 
feel that they are failing when they are not. 


In a field in which testing is justified mainly on the 
basis of hit rates and the like, her patient-centered 
perpective is refreshing and welcome. 


Subtracting Serial Sevens 


A well-known task frequently included in a mental 
status examination is subtracting serial sevens 
(Strub & Black, 1985). The examinee is told to 
“subtract seven from 100.” When this is completed, 
the examinee is then told “Now subtract 7 from 93 
and keep on subtracting sevens until you can’t go 
any further.” The examiner records the number of 
individually incorrect subtractions and may also 
score for time taken and pauses longer than five 
seconds. 

Smith (1967) has reported one of the few nor- 
mative studies of subtracting serial sevens. He 
tested 132 employed adults, most with college or 
professional degrees. Only 2 percent of his sample 
was unable to complete the test, while another 5 
percent made more than five errors. Women were 
more error-prone than men, particularly women 
over 45 with no college education. Thus, examiners 
must not overinterpret minor problems with sub- 
tracting serial sevens. On the other hand, grossly de- 
fective performance—an inability to proceed, very 
high error rate, or very slow subtractions—is char- 
acteristic of individuals with brain impairment. 
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Additional Measures of 
Attentional Impairment 


Many tests devised for other purposes possess a 
strong attentional factor. The Digit Span and Arith- 
metic subtests of the Wechsler intelligence scales 
are recognized as good indices of immediate audi- 
tory attention. The Coding or Digit Symbol subtests 
also load heavily on a freedom-from-distractibility 
factor. 

Smith (1968, 1973) has devised an interesting 
extention of Wechsler’s Digit Symbol, known as 
the Symbol Digit Modalities Test (SDMT). In this 
test, the symbols are printed on the page and the ex- 
aminee writes corresponding numbers underneath 
(Figure 9.7). With this format the examiner can ad- 
minister both a written and an oral trial, which 
helps isolate the source of difficulty with the sub- 
stitution task. For example, an examinee who was 
normatively impaired on the written portion but 
above average on the oral portion might suffer from 
an impairment of motor control. The correlation be- 
tween SDMT and Digit Symbol scores is extremely 
high (r = .91), but the SDMT produces scores that 
are comparatively lower than Digit Symbol (Mor- 
gan & Wheelock, 1992). 

Several tests from the Halstead-Reitan battery 
are good measures of attention (Bennett, 1988). In 
the Speech Sounds Perception Test, the examinee 
must pick from four choices the written version of 


taped nonsense words. For example, the voice on 
the tape might say “freep” while the examinee must 
read from four choices and underline the correct 


- alternative: 


freeb fleeb freep fleep 


The SSPT is highly sensitive to attentional impair- 
ments from any kind of brain damage. The Sea- 
shore Rhythm Test, originally a test of musical 
aptitude in which paired musical rhythms must be 
compared, also turns out to be highly dependent 
upon attentional processes. The Trail Making Test, 
parts A and B, is also sensitive to attentional im- 
pairment. Shum, McFarland, and Bain (1990) dis- 
cuss additional tests of attention. 


I] TESTS OF LEARNING AND MEMORY 


Learning and memory are intertwined processes that 
are difficult to discuss in isolation. Learning new 
material usually requires the exercise of memory. 
Furthermore, many tests of memory incorporate a 
learning curve through repeated administrations. 
The separation of learning and memory processes is 
theoretically possible, but of little practical value in 
clinical assessment. We make no tight distinction 
between these processes. 

Memory tests can be categorized according to 
several dimensions, including short-term versus 





KEY 


Ir EIEIAl> Ir] j= 
112/3141516171819 










Ale c/eEl>/=/IFICcl>I=!c/>/c/= 





FIGURE 9.7 The Symbol Digit Modalities Test (SDMT) 


Source: Reprinted with permission from Smith, A. (1973). Symbol Digit Modalities Test Manual. Los Angeles: Western Psychological 
Services. Material from the Symbol Digit Modalities Test copyright © 1973 by Western Psychological Services. Reprinted by permission 
of the publisher, Western Psychological Services, 12031 Wilshire Boulevard, Los Angeles, California 90025, United States of America. 
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long-term, verbal versus pictorial, and learning 
curve versus no learning curve. These dimensions 
reflect neurological factors discussed in the previ- 
ous section. For example, verbal memory is signif- 
icantly lateralized to the left hemisphere, whereas 
pictorial memory is largely underwritten by the 
right hemisphere. The interested reader can consult 
Lezak (1995) and Reeves and Wedding (1994) for 
more detailed analyses of the neural substrates for 
different types of memory. Here, we will concen- 
trate on the psychometric characteristics of four 
quite dissimilar memory tests. 


Wechsler Memory Scale-lll 


The Wechsler Memory Scale-III (Tulsky, Zhu, & 
Ledbetter, 1997) is a substantial revision of a sim- 
ple one-paged test published more than 50 years 
ago (Wechsler, 1945). The third edition is an ex- 
tensive, multiphasic test of memory consisting of 
17 subtests, including 7 that are optional. The 10 
primary subtests are described in Table 9.9. These 
subtests constitute the basis for obtained age-ad- 
justed scaled scores (mean of 100 and SD of 15) for 
eight primary indices of memory: 


Auditory immediate Auditory delayed 

Visual immediate Visual delayed 

Immediate memory — Auditory recognition 
delayed 

General memory Working memory 


The WMS-III was co-normed with the WAIS- 
III in 1997. The standardization of the new instru- 
ment is superb, with 200 cases selected for each of 
these age bands: 16-17, 18-19, 20-24, 25-29, 
30-34, 35-44, 45-54, 55-64, 65-69, 70-74, 
75-79. For the two oldest age bands (80-84, 
85-89), 150 cases and 100 cases, respectively, were 
included. Based upon 1995 census data, partici- 
pants for the standardization sample were carefully 
stratified as to age, sex, race/ethnicity, educational 
level, and geographic region. 

Validity studies of the WMS-II are strongly 
positive, although factor-analytic investigation 
does not always support the designated breakdown 
into the various aspects of memory previously 


TABLE 9.9 Wechsler Memory Scale-III Primary 
Subtests 





Immediate Recall Subtests 


Logical Memory I: Recall of essential elements from 
brief stories read to the examinee. 

Faces I: Yes-no recall for 24 target faces each presented 
for two seconds. 

Verbal Paired Associates I: Recall for a list of eight 
paired terms, (e.g., truck-arrow) when only the first 
term is presented (e.g., truck-?). 

Family Pictures I: Recall of location and activities of 
persons depicted in pictures of family scenes. 


Letter-Number Sequencing: Reordering of random dig- 
its and letters so that numbers and letters are in correct 
order (e.g.: “7, x, d, s, 4, 2” is reordered as “2, 4, 7, d, s, x”). 
Spatial Span: A visual analogue to Digit Span in which 
numbered blocks are tapped in a particular order; the ex- 
aminee completes both a forward and a backward series. 


Delayed Recall Subtests* 


Logical Memory II 

Faces II 

Verbal Paired Associates II 
Family Pictures II 





“30-minute delayed recall for stimuli in administration I. 


cited. The most powerful evidence for validity is 
that the instrument functions well in the detection 
of memory deficits, In the initial validation studies 
(Tulsky et al., 1997), it was observed that clinical 
groups with neurological disorders (e.g., Alz- 
heimer’s disease, traumatic brain injury) scored 
significantly low on all eight of the WMS-III pri- 
mary indices. For example, a sample of 35 individ- 
uals with probable early stage Alzheimer’s disease 
obtained average scores in the mid- to high 60s on 
six of the eight indices. This is especially notewor- 
thy because memory deficit is the initial complaint 
in the progression of Alzheimer’s disease. 
Validity research with the WMS-III is highly 
promising. For example, in a study of patients with 
different levels of traumatic brain injury (TBD), the 
WMS-III performed better than the WAIS-II in 
identifying patients with mild TBI (Fisher, Ledbetter, 
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Cohen, Marmor, & Tulsky, 2000). This is impor- 
tant because it demonstrates that the WMS-III taps 
relevant aspects of memory (known to be impaired 
in mild TBI) and is not just a proxy measure of in- 
telligence. Furthermore, the WMS-III retains the 
essential features of its predecessor, the WMS-R, 
for which a large body of validity research is avail- 
able. For example, Ryan and Lewis (1988) found 
substantial memory deficits in recently detoxified 
alcoholics. This-is an important finding because 
clinical studies with the original WMS did not re- 
veal memory deficits in alcoholics, which caused 
researchers to doubt the validity of the first edition. 
The WMS-R also functions well in the identifica- 
tion of neuropsychological deficits caused by 
closed head injury (Reid & Kelly, 1993). In a re- 
lated finding, Mittenberg, Azrin, Millsaps, and 
Heilbronner (1993) found that individuals who at- 
tempt to malinger head injury symptoms on the 
WMS-R produce a pattern of scores that can be dis- 
criminated from true cases of head injury. This is 
an important conclusion, because the accuracy of 
test results is usually contested when head-injured 
persons pursue litigation or worker’s compensa- 
tion. The WMS-R also reveals the expected mem- 
ory deficits in patients with schizophrenia, which 
supports the validity of the test (Gold, Randolph, 
Carpenter, and others, 1992). 


Rey Auditory Verbal Learning Test 


In the early 1900s, the Swiss psychologist Edouard 
Claparede (1873-1940) proposed a memory test 
consisting of the free-recall of a 15-item word list. 
This test evolved into the Rey Auditory Verbal 
Learning Test (RAVLT), making it one of the old- 
est mental tests in continuous use (Boake, 2002). 
The test first appeared in French (Rey, 1964), but 
an English-language adaptation has been provided 
by Lezak (1983, 1995) and others. The RAVLT is 
a very popular test of memory, especially for pur- 
poses of clinical research. A search of PsychINFO 
from 1950 onward revealed more than 400 pub- 
lished articles using this simple instrument. 

In administering the RAVLT, the examiner 
reads a list of 15 concrete nouns at the rate of one 
per second. The examinee recalls as many as pos- 


sible in any order. Forewarning the examinee to re- 
call all the words, including those previously re- 
called, the examiner reads the entire list a second 
time. A third, fourth, and fifth administration and 
recall then ensue; these are followed by an inter- 
ference trial with a new list of words. Next, imme- 
diate recall of the original list is tested (without 
benefit of a new presentation). Finally, a recogni- 
tion trial is included in which the examinee must 
underline the administered words from a longer 
written. paragraph. The test yields a number of 
scores, including the number recalled (of 15) for 
each of the initial five trials, the total for the five 
trials (75 possible), the immediate recall after the 
distractor list is read, and the recognition score. 

Rosenberg, Ryan, and Prifitera (1984) con- 
cluded that the RAVLT performs well in the iden- 
tification of patients known to be memory impaired 
by other criteria. In addition to’ an overall reduction 
in performance, memory-impaired patients showed 
a reduced rate of improvement across the five learn- 
ing trials. Adult norms for the RAVLT can be found 
in Geffen, Moar, O’Hanlon, Clark, and Geffen 
(1990) and Wiens, McMinn, and Crossen (1988). 
Norms for children ages 5 to 16 are provided by 
Bishop, Knights, and Stoddart (1990). Ivnik, 
Malec, Smith, and others (1992) contributed age- 
specific norms based on 530 cognitively normal 
persons 56 to 97 years of age. Schmidt (1996) has 
compiled, summarized, and synthesized available 
norms for the RAVLT. 


Fuld Object-Memory Evaluation 


The Fuld Object-Memory Evaluation is a useful test 
of memory impairment in the elderly (Fuld, 1977). 
The test begins by presenting the examinee a bag 
containing 10 common objects (ball, bottle, button, 
etc.). The task is not described as a memory test. 
The examinee is asked to determine whether he/she 
can identify objects by touch alone. Each object is 
felt and then named; the examinee then pulls it out 
of the bag to see if he or she was right. After all 10 
items have been correctly identified, a distractor 
task is administered: rapidly naming words in a se- 
mantic category (e.g., names, foods, things that 
make people happy, vegetables, or things that make 
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people sad). Then, the examinee is asked to recall 
as many of the objects as possible. After each re- 
call, the subject is slowly and clearly reminded ver- 
bally of each item omitted on that trial, a procedure 
called selective reminding (Buschke & Fuld, 1974). 
The examinee is then administered four more 
chances to recall the list by selective reminding, 
with a distractor task after each trial. Delayed recall 
is tested after a 5-minute interval. Finally, the test 
closes with a multiple-choice recognition test. 

The Fuld test is often used to help confirm a 
diagnosis of Alzheimer’s disease, a degenerative 
neurological disorder described in the previous 
topic. In the early stages of Alzheimer’s disease the 
most prominent symptom is memory loss. Elderly 
persons with memory impairment not only score 
lower than control subjects on the Fuld Object- 
Memory Evaluation, they also benefit very little 
from the selective reminding. Fuld (1977) has pro- 
vided norms for community-active and healthy 
nursing-home residents in their 70s and 80s. Fuld, 
Masur, Blau, Crystal, and Aronson (1990) describe 
a prospective study in which the Fuld Object-Mem- 
ory Evaluation demonstrated promise as a predic- 
tor of dementia in cognitively normal elderly. 
Lichtenberg, Manning, Vangel, and Ross (1995) 
describe a program of neuropsychological re- 
search using the Fuld test with older urban medical 
patients. 


Additional Tests of Learning and Memory 


Because of space limitations, we can do no more 
than briefly mention several other useful tests of 
learning and memory. The California Verbal 
Learning Test-II is patterned after the Rey AVLT 
but provides software to quantify and analyze the 
pattern of results (Delis, Kramer, Kaplan, & Ober, 
2000). The Benton Visual Retention Test is a 
design-copying test of visual memory (Sivan, 
1991). The Rivermead Behavioral Memory Test is 
a measure of everyday memory (e.g., route find- 
ing, remembering a name) in rehabilitation set- 
tings (Koltai, Bowler, & Shore, 1996). Good 
reviews of memory tests can be found in Lezak 
(1995), Reeves and Wedding (1994), and Spreen 
and Strauss (1998). 





| ASSESSMENT OF 
| LANGUAGE FUNCTIONS 


As noted in a previous section, language function- 
ing offers a window to the integrity of the left cere- 
bral hemisphere. Thus, neuropsychologists are 
keenly interested in an examinee’s ability to speak, 
read, write, and comprehend what others say. Lit- 
tle wonder that a comprehensive neuropsychologi- 
cal examination always includes one or more 
methods for assessing language functions. 

Neuropsychologists exhibit a special interest in a 
variety of language dysfunctions known collectively 
as aphasia. Briefly stated, aphasia is any deviation 
in language performance caused by brain damage. In 
testing for aphasia, a neuropsychologist might use 
any or all of three approaches: (1) a nonstandardized 
clinical examination, (2) a standardized screening 
test, or (3) acomprehensive diagnostic test of apha- 
sia. We will provide examples of each in our brief re- 
view of assessment methods in aphasia. 


Clinical Examination for Aphasia 


A clinical examination for aphasia has the advan- 
tages of simplicity, flexibility, and brevity. These 
are important attributes when assessing a severely 
impaired patient who may require bedside testing. 
Every practitioner has a slightly different version 
of the brief clinical exam (Lezak, 1995; Reitan, 
1984, 1985). Nonetheless, certain elements com- 
monly are assessed: 


Spontaneous speech: The examiner looks for dis- 
tinctive symptoms of aphasia such as word-find- 
ing difficulty or neologisms (e.g., referring to a 
comb as a “planker”). 

Repetition of sentences and phrases: The exam- 
iner asks the patient to repeat stimuli such as “No 
ifs, ands, or buts,” and. “Methodist Episcopal.” 
The repetition tasks are so simple that normal 
subjects almost never fail them. 

Comprehension of spoken language: The exam- 
iner asks questions (“Does a car have handle- 
bars?”) and issues commands (“Take this paper, 
fold it in half, and put it on the floor”). Again, the 
tasks are so simple that normal subjects almost 
never fail them. 
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¢ Word finding: The examiner points to com- 
mon, easily recognized objects and asks, ““What’s 
this?” Typical items include watch, pen, pen- 
cil, glasses, ring, and shoes. The examiner may 
ask the patient to name numbers, letters, or colors. 

¢ Reading: The examiner requests the patient to 
read and explain a short paragraph suited to prior 
level of education and intelligence. The examiner 
may ask the patient to follow written instructions 
(e.g., “Close your eyes” or “Clap your hands 
three times”). 

e Writing and copying: The examiner asks the pa- 
tient to write spontaneously and from dictation. 
Also, the examiner may ask the patient to copy 
written matter and geometric shapes. The exam- 
iner is interested in grossly ungrammatical writ- 
ten productions and significant distortions in 
copying. 

e` Calculation: The examiner asks the patient to 
perform very simple mathematical calculations 
(e.g., 17 x 3) with and without aid of scratch 
paper. The tasks are so simple that normal sub- 
jects rarely fail. 


Based on the clinical assessment, the examiner 
may fill out a rating scale for severity of aphasia. 
For example, the rating scale used in the Boston Di- 
agnostic Aphasia Exam (Goodglass, Kaplan, & 
Barresi, 2000) includes the following speech 
characteristics: melodic line, phrase length, articu- 
latory agility, grammatical form, word finding, and 
auditory comprehension. 


Screening and Comprehensive 
Diagnostic Tests for Aphasia 


Standardized screening tests for aphasia closely 
resemble the brief clinical exam. The essential dif- 
ference is that standardized screening tests in- 
corporate objective and precise instructions for 
administration and scoring. The weakness of 
screening tests is that they will not detect subtle 
forms of aphasia. The stimuli for a widely used 
screening test of aphasia are depicted in Figure 9.8. 

Comprehensive diagnostic tests for aphasia are 
quite lengthy and used mainly when a patient is 
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Note: Tasks involve naming, spelling, reading, repeating, and 
calculation. 








FIGURE 9.8 Stimulus Figures for the Reitan-Indiana 
Aphasia Screening Test 

Source: Reprinted with permission from Reitan, R. M., & 
Wolfson, D. (1985). The Halstead-Reitan Neuropsychological 
Test Battery: Theory and Clinical Interpretation. Tucson, AZ: 
Neuropsychology Press. The Reitan-Indiana Aphasia Screening 
Test is available from Reitan Neuropsychology Laboratories, 2920 
S. Fourth Ave., South Tucson, AZ 85713. 
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TABLE 9.10 Brief Description of Several 
Aphasia Tests 


Multilingual Aphasia Examination 

(Benton, Hamsher, Rey, & Sivan, 1994) 
This respected, comprehensive battery consists of 11 
subtests and rating scales that assess visual naming, 
repetition, fluency, articulation, spelling, and other 
language variables; available in a Spanish edition, too. 


Western Aphasia Battery 

(Kertesz, 1979) 
Comprehensive test of verbal fluency, auditory 
comprehension, and repetition that aims to identify 
aphasia syndromes and determine their severity. 


Boston Diagnostic Aphasia Examination 

(Goodglass, Kaplan, & Barresi, 2000) 
Comprehensive test with 46 subscales which include 
music, spatial, computation, and seven types of writing 
skill in addition to traditional aphasia measures avail- 
able in French and Hindi versions, too. 


Porch Index of Communicative Ability 

(Porch, 1983) 
A battery containing eighteen 10-item subtests, four 
verbal, eight gestural, and six graphic. Very reliable 
test often used to measure small changes in patient 
performance. 


Token Test 

(Spreen & Strauss, 1998) 
An extremely sensitive test that presents little challenge 
to normal individuals. The examinee must complete 
oral commands with colored tokens, e.g., “Put the 
small red token on top of the large square token.” Orig- 
inally devised by Boller & Vignolo (1966), numerous 
versions of the Token Test are now available. 





known to experience aphasia. These tests provide a 
profile of language skills that is helpful in treatment 
planning. We provide a brief description of several 
aphasia tests in Table 9.10. 


TESTS OF SPATIAL AND 
MANIPULATORY ABILITY 


Tests of spatial and manipulatory ability are also 
known as tests of constructional performance. A 
constructional performance test combines percep- 


tual activity with motor response and always has a 
spatial component (Lezak, 1995). Because con- 
structional ability involves several complex func- 
tions, even mild forms of brain dysfunction will 
result in impaired constructional performance. 
However, careful observation is needed to distin- 
guish the cause of the failed performance, which 
may include spatial confusion, perceptual defi- 
ciency, attentional difficulties, motivational prob- 
lems, and apraxias. The term apraxia refers to a 
variety of dysfunctions characterized by a break- 
down in the direction or execution of complex 
motor acts (Strub & Black, 2000). For example, a 
patient who could not demonstrate how to use a key 
would be diagnosed as suffering from ideomotor 
apraxia. 

Tests of constructional performance embrace 
two large classes of activities: drawing and assem- 
bling. Owing to limitations of space, we will review 
only a few prominent instruments in each category. 


Drawing Tests 


Beyond any doubt, the most widely used drawing 
test is the Bender Visual Motor Gestalt Test, more 
commonly known as the Bender Gestalt Test (BGT; 
Bender, 1938). The BGT consists of nine stimulus 
figures (Figure 9.9); the examinee is instructed to 
copy one at a time on a sheet of blank paper. The ex- 
aminee is told that the BGT “is not a test of artistic 
ability, but try to copy the drawing as accurately as 
possible. Work as fast or as slowly as you wish” 
(Hutt, 1977). 

Several scoring systems have been devised to 
determine whether an examinee’s performance is 
more typical of brain-impaired or non-brain-im- 
paired individuals (Hain, 1964; Hutt & Briskin, 
1960; Lacks, 1999; Pascal & Suttell, 1951; Pauker, 
1976). For adults, the best of these scoring ap- 
proaches is found in Lacks (1999). She identified 12 
qualitative signs scored absent versus present for the 
entire protocol. The presence of any 5 of the signs 
is indicative of brain damage (Table 9.11). Based on 
independent confirmation from other sources of 
information, Lacks reports hit rates of 82 to 86 per- 
cent in a mixed sample of admissions to the acute 
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7. 





FIGURE 9.9 Stimuli for the Bender Visual Motor Gestalt Test 

Source: Reprinted with permission from Bender, L. (1938). A visual motor gestalt test and its clinical use. 
New York: American Orthopsychiatric Association. Copyright © Lauretta Bender and the American Ortho- 
psychiatric Association. 


psychiatric treatment unit of an urban community entire chapter to this instrument, including interpre- 
mental health center (Lacks & Newport, 1980). Sev- tive guidelines for children and adults. 

eral interesting variations on the BGT are discussed The Greek Cross (Reitan & Wolfson, 1993) is 
in Gregory (1999). Groth-Marnat (1990) devotes an a very simple drawing task that is surprisingly sen- 
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TABLE 9.11 Summary of Diagnostic Signs on the Bender Gestalt 


1. Rotation: Figure is rotated 80 to 180 degrees. 

2. Overlapping difficulty: Problem in drawing 
the portions of a single figure that should 
overlap. 

3. Simplification: Figure is simplified. 

4. Fragmentation: Figure is broken up so that the 
overall gestalt is lost. 

5. Retrogression: Substitution of a more primitive 
gestalt form than the stimulus. 

6. Perseveration: Features of a previous stimulus 
carry over in the current stimulus. 


7. Collision: Two separate figures overlap or collide 
with each other. 
8. Impotence: Numerous erasures and inability to 
finish a drawing to personal satisfaction. 
9. Closure difficulty: Difficulty in getting adjacent 
parts of a figure to touch. 
10. Motor incoordination: Tremor is evident in drawing. 
11. Angulation difficulty: Severe difficulty in reproduc- 
ing the angulation of drawings. 
12. Cohesion: Isolated decrease or increase in size of 
subportion of one drawing. 





Note: A 13th error can be counted if the entire test takes longer than 15 minutes. 
Source: Based on Lacks, P. (1999). Bender-Gestalt screening for brain dysfunction (2nd ed.). New York: Wiley. 


sitive to brain impairment. The examinee is re- 
quested carefully to copy the figure without lifting 
the pencil, that is, by tracing the perimeter. The 
stimulus figure and examples of defective perfor- 
mance are shown in Figure 9.10. This test is most 
often evaluated on a qualitative basis, although 
scoring guides do exist (Swiercinsky, 1978; 
Gregory, 1999). 


Assembly Tests 


In his classic book on the parietal lobes, Critchley 
(1953) provided the rationale for including three- 
dimensional construction tasks in a neuropsycho- 
logical test battery: 


It is possible, and indeed useful, to proceed to 
problems in three-dimensional space though tests 
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FIGURE 9.10 

The Greek Cross Stimulus Figure and 
Reproductions from Persons with 
Known Brain Damage 

(a) Stimulus figure. 

(b) Clerical worker with diffuse right hemi- 
sphere dysfunction of unknown origin. 

(c) College professor two years after a right 
hemisphere stroke. 

(d) Patient with generalized, diffuse dementia. 
Source: From Gregory, Robert J. Foundations 
of intellectual assessment: The WAIS-III and 
other tests in clinical practice, p.197. Pub- 
lished by Allyn and Bacon, Boston, MA. 
Copyright © 1999 by Pearson Education. 
Adapted by permission of the publisher. 
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of this character are only too rarely employed. This 
is a more difficult undertaking, and patients who 
respond moderately well to the usual procedures 
with sticks and pencil-and-paper may display gross 
abnormalities when told to assemble bricks accord- 
ing to a three-dimensional pattern. 7 


Benton, Sivan, Hamsher, Varney, and Spreen 
(1994) present a three-dimensional block construc- 
tion test with excellent norms and scoring guide. 
The two forms of the test (Form A and Form B) 
consist of three block models that are presented one 
at a time to the patient. The patient is requested to 
construct an exact replica of the model by selecting 
the appropriate blocks from a set of loose blocks on 
a tray. Based on omissions, additions, substitutions, 
and displacements, the three models are scored 
from 0 to 6, 8, and 15 points, respectively. This test 
is quite sensitive to brain impairment, especially 
when the left or right parietal area is affected. 
Lezak (1995) discusses other assembly tasks. We 
should mention that the Tactual Performance Test 
from the Halstead-Reitan battery is, in part, an as- 
sembly task that measures spatial and manipulatory 
abilities (see Table 9.7). 


ASSESSMENT OF 
EXECUTIVE FUNCTIONS 


Executive functions include logical analysis, con- 
ceptualization, reasoning, planning, and flexibility 
of thinking. The assessment of executive functions 
presents an unusual quandary to neuropsychologists: 


A major obstacle to examining the executive func- 
tions is the paradoxical need to structure a situation 
in which patients can show whether and how well 
they can make structure for themselves. Typically 
in formal examinations, the examiner determines 
what activity the subject is to do with what materi- 
als, when, where, and how. Most cognitive tests, 
for example, allow the subject little room for dis- 
cretionary behavior, including many tests thought 
to be sensitive to executive—or frontal lobe— 
disorders . . , The problem for clinicians who want 
to examine the executive functions becomes how to 
transfer goal setting, structuring, and decision mak- 
ing from the clinician to the subject within the 
structured examination. (Lezak, 1995) 


Many neuropsychologists resolve this quandary 
by using the clinical method to evaluate executive 
functions rather than administering formal tests 
(Cripe, 1996). For example, Pollens, McBratnie, 
and Burton (1988) use interview and observations to 
fill out the structured checklist on executive func- 
tions mentioned in the previous topic. 

Only a limited number of neuropsychological 
tests tap executive functions to any appreciable de- 
gree. Useful instruments in this regard include the 
Porteus Mazes, Wisconsin Card Sorting Test, and a 
novel approach known as the Tinkertoy® Test. We 
remind the reader that the Category Test from the 
Halstead-Reitan battery also captures executive 
functions to some extent (Table 9.7). 

The Porteus Maze Test was devised as a culture- 
reduced measure of planning and foresight (Porteus, 
1965). Without lifting the pencil and attempting to 
avoid dead ends, the examinee must trace a line 
through a series of increasingly difficult mazes. This 
underused instrument is quite sensitive to the effects 
of brain damage, particularly in the frontal lobes 
(Smith & Kinder, 1959; Smith, 1960; Tow, 1955). 

Krikorian and Bartok (1998) have published 
contemporary Porteus Maze norms for children and 
young adults 7 to 21 years of age; these researchers 
also demonstrated that test scores are minimally 
related to IQ scores. Mack and Patterson (1995) 
investigated the Porteus test as a useful measure 
of executive function in elderly patients with Alz- 
heimer’s disease. In a study of 276 pediatric pa- 
tients who had sustained a traumatic brain injury 
(TBI), Levin, Song, Ewing-Cobbs, and Roberson 
(2001) found that the Porteus test was highly sen- 
sitive to TBI severity as measured by the volume of 
tissue damage in the prefrontal areas of the brain. 

The Wisconsin Card Sorting Test (WCST) is a 
good measure of executive functions, although its 
differential sensitivity to frontal lobe damage is de- 
bated (Mountain & Snow, 1993). The instrument 
was devised to study abstract thinking and the abil- 
ity to shift set (Berg, 1948; Heaton, Chelune, Talley, 
and others, 1993). The examinee is given a pack of 
64 cards on which are printed one to four symbols 
(triangle, star, cross, or circle) in one of four colors 
(red, green, yellow, or blue). No two cards are iden- 
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tical. Thus, each card embodies a number, a partic- 
ular shape, and a specific color. The examinee must 
sort these cards underneath four stimulus cards ac- 
cording to an unknown principle (Figure 9.11). For 
example, the unknown principle might be “sort 
according to color.’ As the examinee places cards, 
the examiner says “right” or “wrong.” After the ex- 
aminee has sorted a run of 10 correct placements in 
a row, the examiner shifts the principle without 
warning. The test continues until the examinee has 
made six runs of 10 correct placements. The test can 
be scored in several different ways, including total 
number of trials to criterion (Axelrod, Greve, & 
Goldman, 1994). A common use of the WCST is to 
gauge ongoing recovery in patients with brain 
trauma of recent onset. Thus, the longitudinal con- 
stancy of test scores in patients with stabilized con- 
ditions is a reassuring characteristic of this test 
(Greve, Love, Sherwin, and others, 2002). 

Lezak (1982) devised the Tinkertoy® Test to give 
patients the opportunity to demonstrate executive 
capacities within the structured format of an exami- 
nation. Fifty pieces of a standard Tinkertoy® set are 
placed on a clean surface and the examinee is told, 
“Make whatever you want with these. You will have 
at least five minutes and as much more time as you 
wish to make something.” The test is scored from —1 


to +12 based on several variables including the num- 
ber of pieces used, the mobility of the construction, 
symmetry, and the naming of the construction. 
Head-injured patients produce impoverished de- 
signs consisting of a small number of pieces. These 
individuals often are unable to provide a name for 
their constructions. 

Bayless, Varney, and Roberts (1989) studied the 
predictive validity of the Tinkertoy® Test by com- 
paring the results of 50 patients with closed-head 
injuries versus 25 normal controls. Half of the 
head-injured individuals had returned to work 
while half had not. Whereas all but one of the head- 
injured who returned to work scored normally on 
the Tinkertoy® Test, nearly half of the nonreturnees 
performed below the level of the worst control sub- 
ject. The researchers conclude: 


The test seems particularly well suited for demon- 
strating the presence of deficits in executive func- 
tioning, which have proven to be difficult to 
demonstrate with clinical tests even though they 
have catastrophic sequelae in daily vocational or 
psychosocial endeavors. (Bayless et al., 1989) 


The Tinkertoy® Test also shows promise in the as- 
sessment of individuals with Alzheimer’s disease 
(Koss, Patterson, Mack, Smyth, & Whitehouse, 1998). 
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FIGURE 9.11 The Cards and Sorting Piles for the Wisconsin Card Sorting Test 
Source: Reproduced by special permission from Psychological Assessment Resources, Inc. All rights 
reserved. Copyright © 1981 by Psychological Assessment Resources, Inc. 
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Neuropsychologists still need additional mea- 
sures of executive functions. One promising ap- 
proach in the early stages of development is 
real-world assessment of route finding. The ability 
to find an unfamiliar location in the city requires 
strategy, self-monitoring, and corrective maneu- 
vers. These are executive functions applied to a re- 
alistic problem (Boyd & Sauter, 1993). Another 
promising approach is embodied in a recent battery 
called the Behavioral Assessment of the Dysexecu- 
tive System (Wilson, Alderman, Burgess, and oth- 
ers, 1996). The BADS battery consists of six new 
tests that are similar to real-life activities: 


1. Rule shifting with playing cards 

2. Problem solving with a water-filled beaker 

3. Simulated search for a lost key 

4. Judgment of time needed for activities, events 

5. Route finding with a map 

6. Organization task involving dictation, arith- 
metic, naming 


The battery also includes a 20-item questionnaire 
rated on a 5-point (0 to 4) Likert scale. The items in- 
volve likely changes when executive functions are 
impaired, for example, “I have difficulty thinking 
ahead and planning for the future.” Spreen and 
Strauss (1998) provide a helpful review of this 
promising test battery. Recently Norris and Tate 
(2000) compared the BADS with six other com- 
monly used tests of executive functioning. In a sam- 
ple of 36 neurological patients, they demonstrated 
the ecological superiority of this new instrument in 
predicting competency in everyday role functioning. 


|] assessment OF moror OUTPUT 


Most neuropsychological test batteries include 
measures of manipulative speed and accuracy. 
Lezak (1995) provides a comprehensive review. We 
will briefly summarize three approaches: finger 
tapping, pegboard performance, and line tracing. 
Perhaps the most widely used test of motor dex- 
terity is the Finger-Tapping Test from the Halstead- 
Reitan battery. This test consists of a tapping key 
that extends from a mechanical counting device at- 
tached to a flat board. With the index finger of each 
hand, the examinee completes a series of 10-sec- 


ond trials until five trials in a row are within a 5- 
point range. The score for each hand is the average 
of these five trials, rounded to the nearest whole 
number. With the dominant hand, males typically 
score about 54 taps (SD of 4), whereas females typ- 
ically score about 51 taps (SD of 5; Dodrill, 1979; 
Morrison, Gregory, & Paul, 1979). 

In general, the absolute level of performance is 
of less interest than the relative abilities on the two 
sides of the body. Normative expectation is that the 
nondominant hand will yield a tapping rate about 
90 percent of the dominant hand. Significant devi- 
ations from this pattern are thought to indicate a le- 
sion in the hemisphere opposite that of the slowed 
hand (Haaland & Delaney, 1981). However, such 
inferences must be made with great caution owing 
to the very low reliability of the ratio score. Al- 
though test-retest and interexaminer reliabilities for 
either hand alone approach .80, the reliability of the 
ratio score is a dismal .44 to .54 (Morrison, Gre- 
gory, & Paul, 1979), The ratio score should be used 
with extreme caution in making clinical inferences 
about lateralization of damage. 

The Purdue Pegboard Test requires the exami- 
nee to place pegs in holes with the left hand, right 
hand, and then both hands. Each trial lasts only 30 
seconds, so the entire test can be administered in a 
matter of minutes. Tiffin (1968) reports normative 
scores for work applicants. Relative slowing in one 
hand suggests a lesion in the opposite hemisphere, 
whereas bilateral slowing indicates diffuse or bi- 
lateral brain damage. Using the Purdue Pegboard 
Test in isolation, one study found an 80 percent ac- 
curacy in identifying brain impairment among a 
large group of normal subjects and neurological pa- 
tients (Lezak, 1983). Other studies report much less 
favorable findings (Heaton, Smith, Lehman, & 
Vogt, 1978). The Purdue Pegboard Test is a useful 
addition to a comprehensive battery but should not 
be used in isolation for screening purposes. Spreen 
and Strauss (1998) provide an excellent summary 
of norms for this widely used test. 

Klove has developed a variation on the peg- 
board test in which the pegs have a ridge along one 
side (Klove, 1963). Because each peg must be ro- 
tated into position, the Grooved Pegboard requires 
complex coordination in addition to motor dexter- 
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FIGURE 9.12 A Typical Line-Tracing Task (Reduced Size) 


ity. The Grooved Pegboard test is an excellent in- 
strument for assessing lateralized brain damage 
(Haaland & Delaney, 1981). 

Finally, we should mention that useful motor 
tests need not require sophisticated equipment. 
Lezak (1995) recommends a line tracing task to as- 
sess difficulties in motor regulation (Figure 9.12). 
The examinee is given a brightly colored felt-tipped 
pen and a sheet of paper with several figures and told 
to draw over the lines as rapidly as possible. Diffi- 
culties with motor regulation show up in overshoot- 
ing corners, perseveration of an ongoing response, 
and inability to follow the reduced curves in the bot- 
tom figure. Because this task is easily completed by 
most 10-year-olds, any noticeable deviations are 
suggestive of difficulties in motor regulation. 


TEST BATTERIES IN 
NEUROPSYCHOLOGICAL ASSESSMENT 


Now that we have completed a tour of some indi- 
vidual neuropsychological tests and procedures, it 
is time once again to remind the reader that many 
neuropsychologists prefer to use a fixed battery 
rather than an ever-shifting, individualized assort- 
ment of instruments. Certainly, one of the most 
widely used fixed batteries is the Luria-Nebraska 
Neuropsychological Battery (LNNB; Golden, 
1989; Golden, Purish, & Hammeke, 1980, 1986), 
now in its third edition (LNNB-III; Teichner, 
Golden, Bradley, & Crum, 1999). 


The test consists of 269 discrete items, chosen 
from the work of Luria (e.g., 1966) and formally stan- 
dardized. These items are scored 0, 1, or 2 according 
to precise criteria in the administration and scoring 
manual. Similar items are grouped together into 11 
clinical scales, C1 through C11 (Table 9.12). Raw 
scores on each scale are converted into T scores, with 
a mean of 50 and a standard deviation of 10. Higher 
scores reflect more psychopathology; scores above 
70 are especially suggestive of brain impairment. 

Three summary scales are also derived from 
test performance: S1 (Pathognomonic), S2 (Left 
Hemisphere), and S3 (Right Hemisphere). The 
Pathognomonic scale reflects the degree of com- 
pensation that has occurred since an injury, such as 
functional reorganization of the brain as well as 
actual physical recovery. Higher scores reflect less 


TABLE 9.12 Tests and Procedures of the Luria- 
Nebraska Neuropsychological Battery 


Ability Scale: Tasks Included 


C1 Motor: Coordination, speed, drawing, complex 
motor abilities 

C2 Rhythm: Attend to, discriminate, and produce ver- 
bal and nonverbal rhythmic stimuli 

C3 Tactile: Identify tactile stimuli, including stimuli 
traced on the wrists 

C4 Visual: Identify drawings, including overlapping 
and unfocused objects; solve progressive matrices and 
other visuospatial skills 

C5 Receptive Speech: Discriminate phonemes and 
comprehend words, phrases, sentences 

C6 Expressive Speech: Articulate sounds, words, 

and sentences fluently; identify pictured or described 
objects 

C7 Writing: Use motor writing abilities in general; 
copy and write from dictation 

C8 Reading: Read letters, words, and sentences; syn- 
thesize letters into sounds and words 

C9 Arithmetic: Complete simple mathematical compu- 
tations; comprehend mathematical signs and number 
structure 

C10 Memory: Remember verbal and nonverbal stimuli 
under both interference and noninterference conditions 
C11 Intelligence: Reasoning, concept formation, and 
complex mathematical problem solving 
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compensation. The Left Hemisphere and Right 
Hemisphere scales can be used to help determine 
whether an injury is diffuse or lateralized. A num- 
ber of other scales and interpretive factors are also 
available (Golden, Purish, & Hammeke, 1986). The 
use ofthe LNNB is illustrated in Case Exhibit 9.2. 


We cannot review the voluminous literature on 
the LNNB, but brief mention of a few key studies 
certainly is merited. The reliability of the LNNB 
has been evaluated from the usual perspectives 
(split-half, internal consistency, and test-retest), 
with excellent results. For example, the mean test- 
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retest reliability for the clinical scales was near .90 
(Bach, Harowski, Kirby, Peterson, & Schulein, 
1981; Plaisted & Golden, 1982; Teichner et al., 
1999). In various validity studies of classification 
of brain-damaged persons versus other criterion 
groups, the LNNB has shown hit rates of 80 percent 
or better (Golden, Moses, Graber, & Berg, 1981; 
Hammeke, Golden, & Purish, 1978; Moses & 
Golden, 1979; Teichner et al., 1999). 

In spite of the positive appraisals of the LNNB 
reported by Golden and his colleagues, some neu- 
ropsychologists remain skeptical of the test (e.g., 
Lezak, 1995). One concern is that the heterogene- 
ity of the scales is so great that the individual scale 
scores do not quantify specific neuropsychological 
deficits but instead serve only to differentiate nor- 
mal persons from brain-damaged patients (Snow, 
1992; Van Gorp, 1992). Early reviewers also ex- 
pressed concern that the speech scales were not ori- 
ented to syndromes of aphasia and could therefore 
misdiagnose language deficits (Delis & Kaplan, 
1982). In defense of the LNNB, Purish (2001) 
contends that initial criticisms were based on mis- 
conceptions as to the theoretical basis for the in- 
strument. Furthermore, in his view, these criticisms 
have been largely negated by an expanding body of 
empirical research supporting the test. 


ASSESSMENT OF MENTAL 
STATUS IN THE ELDERLY 


The mental status examination (MSE) is a loosely 
structured interview that usually precedes other 
forms of psychological and medical assessment. 
The purpose of the evaluation is to provide an ac- 
curate description of the patient’s functioning in the 
realms of orientation, memory, thought, feeling, 
and judgment. The MSE is the psychological 
equivalent of the general physical examination: 
Just as the physician reviews all the major organ 
systems, looking for evidence of disease, the psy- 
chologist reviews the major categories of personal 
and intellectual functioning, looking for signs and 
symptoms of psychopathology (Gregory, 1999). 
Although there is some latitude as to the scope of 


the MSE, certain mental functions are almost al- 
ways investigated. A typical evaluation touches 
upon the areas listed in Table 9.13. 

Some of the elements in this list can be assessed 
with short screening tests. In particular, cognition, 
memory, and orientation are intellectual functions 
that can be tested in a formal, structured manner 
(Hodges, 1994). In this section, we review several 
brief measures of mental status used by clinicians 
to supplement interview impressions. These mea- 
sures are most commonly used in the mental status 


TABLE 9.13 Major Areas of a Typical Mental 
Status Exam 


Appearance and Behavior 
Grooming 

Facial expressions 

Gross motor behavior 

Eye contact 


Speech and Communication Processes 
Speech content, rate, tone, volume 
Word difficulty, confusion, misuse 


Thought Content 

Logic, clarity, appropriateness 
Delusions 

Cognitive and Memory Functioning 
Calculating ability 

Immediate recall 

Recent and remote memory 
Fund of information 
Abstracting ability 
Emotional Functioning 
Predominant mood 
Appropriateness of affect 
Insight and Judgment 
Awareness of problems 


Orientation 
Day, date, time, location 





Source: Based on Gregory, R. J. (1999), Foundations of intellec- 
tual assessment: The WAIS-III and other tests in clinical practice. 
Boston: Allyn and Bacon. 
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evaluation of the elderly, especially when the client 
appears to have a dementia such as Alzheimer’s 
disease. Formal tests of mental status are also help- 
ful in the assessment of certain brain-impairing 
conditions such as head injury, schizophrenia, se- 
vere depression, and drug-induced delirium. It is 
important to emphasize that screening tests are sup- 
plementary—they do not replace clinical judgment 
in the evaluation of mental status. Some areas cov- 
ered by the MSE are simply impossible to quantify. 
For example, the evaluation of a patient’s insight 
requires keen observation and sensitive interview- 
ing skills. An MSE screening test for insight does 
not exist. 


Mini-Mental State Exam 


The most widely used mental status tool is the 
Mini-Mental State Examination (MMSE), a 5- 
to 10-minute screening test that yields an objec- 
tive global index of cognitive functioning (Fol- 
stein, Folstein, & McHugh, 1975; Tombaugh, 
McDowell, Kristjansson, & Hubley, 1996). The test 
contains 30 scorable items having to do with 
orientation, immediate memory, attention, calcula- 
tion, language production, language comprehen- 
sion, and design copying. The items are so easy that 
normal adults almost always obtain scores in the 
range of 27 to 30 points (Figure 9.13). 

The reliability of this simple instrument is ex- 
cellent. Folstein et al. (1975) report a 24-hour test- 
retest reliability of .89 for 22 patients with varied 
depressive symptoms. Reliability over a 28-day pe- 
riod for 23 clinically stable patients with diagnoses 
of dementia, depression, and schizophrenia was an 
impressive .99. Normative data are available from 
several sources (e.g., Lindal & Stefansson, 1993; 
Tombaugh, McDowell, Kristjansson, & Hubley, 1996). 

Using a cutting score of 23 or below as abnor- 
mal and 24 or above as normal, the MMSE is about 
80 to 90 percent accurate in identifying elderly pa- 
tients with suspected Alzheimer’s disease or other 
dementia. This cutting score produces few false 
positives (normal patients classified as having de- 
mentia). The sensitivity of the instrument depends 





5 Orientation to Time (day, date, month, season, 
and year) 


5 Orientation to Place (floor, building, city area, 
city, state) 


3 Immediate Memory (three words presented 
orally) 


5 Attention and Calculation (serial 7s, five 
subtractions) 


3 Delayed Recall (three words presented orally 
above) 


2 Naming (pencil and watch) 
1 Repetition (brief sentence presented orally) 


3 Comprehension (follow simple three-part oral 
command) 


1 Reading (read simple command and obey) 
l Writing (compose a simple sentence) 
1 Drawing (reproduce two intersecting pentagons) 


30 Total 





FIGURE 9.13 Scoring Weights and Domains of the 
Mini-Mental State Examination 


upon a number of factors, including the cutting 
score used, the educational level of the examinee, 
the extent of the dementia, the nature of the under- 
lying pathology, and the type of setting in which as- 
sessments are undertaken (Anthony, LeResche, 
Niaz, Von Korff, & Folstein, 1982; Tombaugh, 
McDowell, Kristjansson, & Hubley, 1996; Tsai & 
Tsuang, 1979). In spite of its limitations, the 
MMSE remains the most reliable and practical 
screening test for dementia in the elderly (Ferris, 
1992). Drebing, Van Gorp, Stuck, and others 
(1994) recommend its use as part of a short screen- 
ing battery for cognitive decline in the elderly. Sev- 
eral additional measures of geriatric mental status 
are outlined in Table 9.14. 
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TABLE 9.14 Mental Status Tests Used with Geriatric Patients 


Test 


Cognistat 

Kiernan, Mueller, and Langston (1997) 
Information-Memory-Concentration 
Blessed, Tomlinson, and Roth (1968) 

Short Portable Mental Status Questionnaire 
Pfeiffer (1975) 

Dementia Rating Scale 

Mattis (2001) 

Test of Temporal Orientations 

Benton, Sivan, Hamsher, Varney, and Spreen 
(1994) 

Alzheimer’s Disease Assessment Scale 
Rosen, Mohs, and Davis (1984) 

Cambridge Cognitive Examination 

Roth, Tym, Mountjoy, & others (1986) 
‘Severe Impairment Battery 

Saxton, McGonigle-Gibson, Swihart, 

and others (1990) 


Content 


Language, construction (copying), memory, calculation, and 
reasoning/judgment 
Information, orientation, concentration 


Information, orientation, attention 


Attention, memory, construction (copying), conceptualization, 
verbal fluency 
Orientation 


Orientation, memory, language, construction (copying) 


Orientation, memory, language, construction (copying), attention, 
abstraction, perception, calculation 

Orientation, memory, language, attention, social interaction, con- 
struction (copying), praxis, visuo-perception 





SUMMARY 


1. For purposes of assessment, cognitive pro- 
cessing is viewed as proceeding sequentially 
through the following stages: sensory input, atten- 
tion and concentration, learning and memory, lan- 
guage skills and/or visual-spatial/manipulatory 
skills, executive functions, and motor output (see 
Figure 9.6). 


2. The assessment of sensory input is typi- 
cally achieved through unilateral and bilateral 
stimulation in the modalities of touch, hearing, and 
vision. Typical tasks (e.g., finger localization) 
are so simple that normal persons rarely make 
errors. 


3. Measures of attentional impairment in- 
clude subtracting serial sevens; the Continuous 
Performance Test, a family of computerized vigi- 
lance tasks; and the Paced Auditory Serial Addi- 
tion Test, a speeded test of mental arithmetic 
(adding successive pairs of digits), which is very 
sensitive to the effects of concussion. 


4. A respected memory test is the Wechsler 
Memory Scale-III, a substantial revision of the 
original scale published nearly 50 years ago. Care- 
fully standardized, the WMS-III consists of 17 
subtests, including some with surprise recall a half 
hour after the original administration. 


5. Another widely used memory test is the 
Rey Auditory Verbal Learning Test (RAVLT) in 
which the same list of 15 concrete nouns is read to 
the examinee for five successive trials. Recall is 
tested after each trial and also after an interpolated 
word list is administered. 

6. Aphasia is any deviation in language per- 
formance caused by brain damage. Tests of apha- 
sia (e.g., Reitan’s Aphasia Screening Test or the 
Boston Diagnostic Aphasia Exam by Goodglass 
and Kaplan) typically assess spontaneous speech, 
repetition of sentences and phrases, comprehen- 
sion of spoken language, word finding, reading, 
writing, copying, and calculation. 
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7. Tests of spatial and manipulatory ability 
include drawing or copying tests such as the 
Bender Visual Motor Gestalt Test, and three- 
dimensional block construction tests; both types 
are sensitive to the effects of brain damage. 


8. Executive functions include logical analy- 
sis, conceptualization, reasoning, planning, and 
flexibility of thinking. Useful tests for the assess- 
ment of executive functions include the Porteus 
Maze Test; the Wisconsin Card Sorting Test; and 
the Tinkertoy® Test, so-named because of the ma- 
terials used. 


9. Neuropsychological test batteries com- 
monly include measures of motor output such as 
the Finger Tapping Test from the Halstead-Reitan 
battery. Typically, the nondominant hand is 10 per- 
cent slower than the dominant hand; deviations 
from this pattern may indicate a lesion in the hemi- 
sphere opposite that of the slowed hand. 


10. Other useful motor tests include the Pur- 
due Pegboard Test, which requires the examinee to 


place pegs in holes with the left hand, right hand, 
and then both hands; and simple line-tracing tasks 
easily completed by most 10-year-olds. 


11. The Luria-Nebraska Neuropsychological 
Battery consists of 269 discrete items modeled 
upon the work of Luria and formally standardized. 
The test developer and his colleagues report ex- 
cellent reliability and strong validity (e.g., hit rates 
of 80 percent or better in identification of brain- 
damaged subjects). 


12. The mental status examination (MSE) is a 
loosely structured interview that usually precedes 
other forms of psychological and medical assess- 
ment. Areas assessed in the MSE include orienta- 
tion, memory, thought, feeling, and judgment. 


13. A helpful mental status screening test par- 
ticularly useful with the elderly is the Mini-Mental 
Status Examination. This 30-item test has high 
reliability and hit rates in some populations of 80 
to 90 percent for the detection of dementia in the 
elderly. 


KEY TERMS AND CONCEPTS |_ 


concussion p. 333 
Alzheimer’s disease p. 337 
aphasia p. 337 


apraxia p. 339 
executive functions p. 342 
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Assessment of Emotional and Behavioral Disorders 
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I: this chapter, we explore the applications of 
testing within two distinctive environments— 
schools and the legal system. Although disparate in 
many respects, these two settings for assessment 
share some essential features. In both arenas, legal 
guidelines exert a powerful and constraining influ- 
ence upon the practice of testing. Issues of proper 
diagnosis and classification are especially perti- 
nent. For example, in assessing a student for learn- 
ing disability, the school psychologist observes 
federal guidelines that circumscribe the definition 
of these disorders. Similarly, in evaluating a client 
for competency to stand trial, the forensic psychol- 
ogist follows published guidelines that determine 
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the boundaries of legal fitness. In both cases, the 
practice of assessment is shaped, informed, and 
guided by legal precepts. 

School-based assessment and forensic assess- 
ment certainly comprise two of the more interest- 
ing arenas for psychological testing. In Topic 10A, 
School-Based Assessment, we consider the use of 
psychological tests within schools for purposes 
such as screening, assessment, and special place- 
ment. In Topic 10B, Forensic Applications of 
Assessment, we analyze the unique challenges en- 
countered by forensic psychologists who perform 
court-based evaluations. Of course, relevant tests 
are surveyed and catalogued. But more important, 
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we focus upon the special issues and challenges en- 
countered within these distinctive milieus. 

While it is true that a major application of 
school-based testing is evaluation for learning dis- 
abilities (discussed at length in the following), psy- 
chologists provide many other assessment services 
for schools. These include screening for school 
readiness, evaluation of attention deficit and other 
behavioral problems, and testing for giftedness. A 
survey of these topics is provided in the following 
sections. Additional school-based assessment prac- 
tices such as evaluation for mental retardation and 
assessment of children with disabilities are de- 
scribed in Topic 7A, Testing Special Populations. 


ll scREENING FoR REE aA 


The purpose of screening is to identify at-risk chil- 
dren so they can be referred for more comprehen- 
sive evaluation (Kamphaus, 1993). But “at-risk” for 
what? The general answer refers to likelihood of 
failure in the early elementary years of schooling. 
The notion of being at-risk is intimately linked to the 
concept of developmental delay, which refers to 
children whose cognitive development is well below 
age expectations. Some children identified with this 
label “catch up” later in life. For these children, de- 
velopmental delay is an appropriate designation. 
Certainly it is a more optimistic and less-stigmatiz- 
ing label than mental retardation—which is often 
the ultimate outcome of developmental delay. 

Children with low intelligence are substantially 
at-risk for school failure, which explains why indi- 
vidual intelligence tests play an important role in 
the evaluation of preschool children. But individual 
intelligence tests require a substantial commitment 
of time (up to two hours) and must be administered 
by carefully trained practitioners. For practical rea- 
sons, then, individual intelligence tests are not suit- 
able as screening instruments. 

The ideal screening instrument is a short test 
that can be administered by teachers, school nurses, 
and other individuals who have received limited 
training in assessment. In addition, a sensible 
screening test is one that provides a cutoff score 
that is accurate in classifying children as normal or 
at-risk. In the context of screening tests, two kinds 


of errors can occur. Normal children who fail the 
test would be referred to as false-positive cases (be- 
cause they are falsely classified as positive for po- 
tential disability). At-risk children who pass the test 
would be referred to as false-negative cases (be- 
cause they are falsely classified as negative for po- 
tential disability). The reader must keep in mind 
that the purpose of screening is merely to identify 
children in need of additional evaluation, which 
means that false-positive cases willireceive further 
evaluation. Hence a false-positive misclassification 
rarely leads to undesirable consequences. However, 
false-negative cases typically do not receive further 
evaluation, so this kind of misclassification is po- 
tentially more serious—because a needy child is 
deemed to be normal. Glascoe (1991) recommends 
that a useful instrument should yield a false-nega- 
tive rate of less than 20 percent (meaning that 80 
percent of truly at-risk children are flagged by the 
test) and an even lower false-positive rate of less 
than 10 percent (meaning that 90 percent of normal 
children pass the test). 

Screening tests are by definition brief and there- 
fore prone to measurement errors. Even with the best 
tests available, deserving children will slip through 
the cracks (false-negative cases) and go unrecog- 
nized as needing intervention until they are well into 
their elementary years in school. However, if it iden- 
tifies a high proportion of true-positive cases— 
preschoolers classified at-risk who turn out really to 
need special services or delayed school entrance— 
a screening test still serves a useful purpose. 


Instruments for Preschool Screening 


Although dozens of instruments have been pro- 
duced to screen for developmental delays (Mal- 
colm, 1998), we limit discussion here to just three 
tests: the DIAL-III (Developmental Indicators for 
the Assessment of Learning-III), the Denver II (a 
revision of the Denver Developmental Screening 
Test-Revised), and the HOME (Home Observation 
for the Measurement of the Environment). The first 
two tests use conventional approaches for the iden- 
tification of developmental delay, whereas the third 
instrument, the HOME, embodies a radical depar- 
ture from traditional procedures. 


DIAL-III 


The Developmental Indicators for the Assessment 
of Learning-III is an individually administered 
screening procedure designed for the quick and ef- 
ficient detection of developmental problems (or gift- 
edness) in preschool children ages 3:0 through 6:11 
(Mardel-Czudnowski & Goldenberg, 1998). The 
test kit includes materials and normative data for 
both English-speaking and Spanish-speaking chil- 
dren. The test screens the performance of children in 
three developmental domains: motor, concepts, and 
language. Items in these domains are administered 
directly to the child by the examiner. In addition, 
standardized scores are also obtained in Self-Help 
and Social Development by means of a parent ques- 
tionnaire. Examples of test items within the three de- 
velopmental domains include the following: 


Motor: Fine-motor items include block build- 
ing, cutting, copying shapes and letters, 
name writing, and finger touching; gross- 
motor items include catching, jumping, hop- 
ping, and skipping. 

Concepts: Pointing to named body parts, nam- 
ing or identifying colors, rote counting, 
counting blocks, positioning blocks, identi- 
fying concepts, and sorting shapes. 

Language: Giving personal information (name, 
age, sex), naming objects and actions, proper 
articulation, and phonemic awareness (e.g., 
rhyming). 


Scoring for some items is discrete and objective, 
whereas for other questions the scoring criteria in 
the manual leave room for subjective interpretation, 
which detracts from the reliability of the instrument. 
A total score is obtained by summing the three area 
scores. For each area score and also the total score, 
the manual provides cutoff scores for assigning the 
child to one of two outcome groups labeled “poten- 
tial delay” and “okay.” The standardization samples 
consisted of 1,560 children (English-speaking) and 
605 children (Spanish-speaking) stratified roughly 
by 1994 census data for gender, race, geographic re- 
gion, and parental education. 

Reliability of the DIAL-II is fair, given that it 
is a brief test for screening purposes. Internal con- 
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sistency coefficients range from .66 for Motor to 
.84 for Concepts, with a total scale reliability of :87. 
Test-retest data are similar, which is to say, not up 
to the suggested minimum reliability of .90 for tests 
used to make individual decisions (Nunnally & 
Bernstein, 1994). Validity of the instrument has 
been evaluated along the familiar lines of content, 
construct, and criterion-related. Content validity is 
judged to be high insofar as.a panel of experts 
provided content reviews and helped eliminate in- 
appropriate and biased items. Criterion-related 
validity is strong, as judged by correlations with 
similar instruments such as the Early Screening 
Profiles, Differential Abilities Scale, and Peabody 
Picture Vocabulary Test-III. 

It is with regard to practical utility that the 
DIAL-III and its precursor editions have raised the 
greatest skepticism. The value of a screening test is 
best judged by the extent to which it accurately 
identifies children in need of further developmen- 
tal assessment. One useful statistic is sensitivity, 
which is the proportion of confirmed problem cases 
accurately “flagged” as problem cases by a test 
(i.e., children with delay who are accurately classi- 
fied as “potential delay”). Unfortunately, brief 
screening tests such as the DIAL-III do not reveal 
strong sensitivity when the recommended cutoff 
scores are used to identify children as showing “po- 
tential delay.” The only way to achieve high sensi- 
tivity is to liberalize the cutoff scores, that is, 
classify a larger proportion of children as showing 
“potential delay.” However, this will create prob- 
lems with specificity, which is the percentage of 
normal children correctly identified as normal. In- 
variably, the higher the sensitivity of a screening 
test, the lower its specificity. The cost of accurately 
identifying a high proportion of children with delay 
is that many normal children also will be tentatively 
labeled as “potential delay.” 


Denver II 


The Denver II (Frankenburg, Dodds, Archer, and 
others, 1990) is an updated version of the highly 
popular Denver Developmental Screening Test- 
Revised (Frankenburg, 1985; Frankenburg & 
Dodds, 1967). The Denver test is probably the 


354 __CHAPTER 10 SPECIAL SETTINGS FOR PSYCHOLOGICAL ASSESSMENT 


most widely known and researched pediatric 
screening toof in the United States, The instru- 
mentis popular worldwide—it has been translated 
into 44 different languages. Suitable for infants 
and children ages 1 month to 6 years, the test con- 
sists of 125 items in four areas: personal-social, 
fine motor—adaptive, language, and gross motor. 
The items are a mix of parent report, direct elici- 
tation, and observation. Each item is arranged 
chronologically on the test by age of the child 
and marked pass/fail. Testing begins at an age- 
appropriate level and continues until the child 
fails three items. Total time for evaluation is 20 
minutes or less. 

Unlike other screening tests, the Denver II does 
not produce a developmental quotient or score. In- 
stead, results on about 30 age-appropriate items 
provide a score that can be interpreted as normal, 
questionable, or abnormal in reference to age-based 
norms. A category of “untestable” also is included. 
The standardization sample consisted of 2,096 chil- 
dren, all from the state of Colorado, stratified by 
age, race, and socioeconomic status: 

Recognizing that no instrument is perfectly re- 
liable and that children change over time, the de- 
velopers of the Denver I recommend repeat testing 
at approximately six-month intervals up to age 2 
years, and then yearly thereafter through age 5 
years. Reliability of the Denver II is reported to be 
outstanding for a brief screening test. Interrater re- 
liability among trained raters averaged an out- 
standing .99. Test-retest reliability for total score 
over a 7- to 10-day interval averaged .90. 

The Denver possesses excellent content valid- 
ity insofar as the behaviors tested are recognized 
by authorities in child development as important 
markers of development. However, the test inter- 
pretation categories (normal, questionable, abnor- 
mal) were based upon clinical judgment. and 
therefore await additional study for validation. A 
few initial studies raise significant concerns. Glas- 
coe and Byrne (1993) evaluated 89 children in day 
care settings who were 7 to 70 months of age. 
Based upon extensive independent evaluation, 18 
of these 89 children were confirmed to have de- 
velopmental delays according to federal defini- 


tions of disabling conditions (e.g., language de- 
lays, mental retardation, and autism). While the 
Denver II functioned well in correctly identifying 
15 of the 18 at-risk children, the instrument per- 
formed poorly with the normal children. In fact, 
38 of the 71 normal children failed the test and 
were classified as questionable or abnormal. Over- 
all, almost 4 in 6 children taking the test would 
be referred for additional assessment, and of 
the 4, only 1 would have a true disability. The re- 
searchers conclude: 


This causes parents needless expense and anxiety, 
wastes precious diagnostic and intervention re- 
sources, and leaves professionals with unanswered 
questions about children’s developmental status 
both before and after screening. 


They recommend further validational study with 
recalibration and possible discarding of some test 
items before the test receives widespread use. 


HOME 


The Home Observation for Measurement of the 
Environment (HOME), popularly known as the 
HOME Inventory, is probably the most widely used 
index of children’s environment. Based upon in- 
home observation and an interview with the pri- 
mary caretaker, the instrument provides a measure 
of children’s physical and social environments. The 
HOME Inventory comes in three forms: Infant and 
Toddler, Early Childhood, and Middle Childhood. 
The latest editions of the instrument, dated 1984, 
emerged after 15 years of methodical revision and 
refinement (Caldwell & Richmond, 1967; Caldwell 
& Bradley, 1984, 1994), 


Background and Description 


Prior to the development of the HOME Inventory, 
the measurement of children’s environments was 
based largely upon demographic data such as 
parental education, occupation, income, and loca- 
tion of residence. Often, these indices were com- 
bined into a cumulative measure referred to as 
social class or socioeconomic status (SES). For ex- 


ample, Hollingshead and Redlich (1958) developed 
a continuum of social class derived from residence, 
occupation, and education of the head of the house- 
hold. The SES score for a family whose household 
head worked at a clerical job, was a high school 
graduate, and lived in a middle-rank residential 
area would be computed as follows (Hollingshead 
& Redlich, 1958): 


Scale ` Factor _ Partial 
Factor Value Weight ~ Score 
Residence 3 6 18 
Occupation 4 9 36 
Education 4 5 20 
Index of Socioeconomic Status = 74 


For research purposes, social scientists may cate- 
gorize families into a fivefold hierarchy of social 
classes (classes I through V) based upon the total 
score. The reader will notice that the Hollingshead 
and Redlich measure was derived entirely from sta- 
tus indices. The unstated assumption is that these 
indices reflect, indirectly, meaningful environmen- 
tal variation. Put bluntly, proponents of SES as an 
environmental measure believe that, on average, 
children from a higher social class will experience 
a richer and more nurturant environment than chil- 
dren from a lower social class. 

In contrast to the SES approach, the HOME In- 
ventory was developed to provide a direct process 
measure of children’s environments. The guiding 
philosophy of this instrument is that direct assess- 
ment of children’s experiences is a better index of 
the home environment than such indirect measures 
as parental occupation and education. Although it 
is true that social class—as embodied in occupa- 
tion, education, residence—provides an oblique 
measure of environmental richness, the authors of 
the HOME Inventory would argue that direct as- 
sessment of children’s experiences provides a more 
accurate index of variations in the home environ- 
ment. Thus, assessment with the HOME involves, 
in part, direct observation of children’s home en- 
vironments to determine whether certain types of 
crucial interactions and experiences are present or 
absent. For example, during an hour-long visit, the 
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examiner observes whether the parent sponta- 
neously communicates with the child at least five 
times, determines whether the child has at least 10 
children’s books or story records, and assesses 
whether the neighborhood is aesthetically pleas- 
ing according to detailed standards, to cite just a 
few examples. 

The purpose of the HOME Inventory is to mea- 
sure the quality and quantity of stimulation and 
support for cognitive, social, and emotional devel- 
opment available to the child in the home. The 
scales and items of the HOME were derived from a 
list of environmental processes identified from ex- 
isting research and theory as important for optimal 
childhood development (Caldwell & Bradley, 
1984). These growth-promoting processes include 
basic need gratification; frequent contact with a rel- 
atively small number of adults; a positive emotional 
climate that fosters trust of self and others; appro- 
priate, varied, and patterned sensory input; con- 
sistency in the physical, verbal, and emotional 
responses of others; a minimum of social restric- 
tions on exploratory and motor behavior; structure 
and order in the daily environment; provision and 
adult interpretation of varied cultural experiences; 
appropriate play materials and environment; contact 
with adults who value achievement; and the cumu- 
lative programming of experiences to match the 
child’s developmental level (Caldwell & Bradley, 
1984). In brief, then, the purpose of the HOME is to 
measure specific, designated patterns of nurturance 
and stimulation available to children in the home. 

In order to complete the HOME Inventory, the 
examiner must observe the child and caregiver 
(usually the mother) interacting in the home envi- 
ronment. Ratings for a few inventory items are 
derived from observation of the physical environ- 
ment. In addition, completion of some items is 
based upon self-report of the caregiver. Items are 
dichotomously scored, 1 for present, 0 for absent. 
For example, one item asks whether the child is in- 
cluded in grocery store shopping at least once a 
week. The manual for the inventory encourages a 
relaxed, semistructured approach to observation 
and interview (Caldwell & Bradley, 1984). Com- 
pletion of the inventory takes about an hour. 
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The three forms of the HOME are Infant and 
Toddler (ages 0 to 3 years), Early Childhood (ages 3 
to 6 years), and Middle Childhood (ages 6 to 10 
years). The Infant and Toddler form consists of 45 
items organized into the following six subscales: 


Emotional and Verbal Responsivity of Parent 
Acceptance of the Child’s Behavior 
Organization of the Environment 

Provision of Appropriate Play Materials 
Parent Involvement with Child 

Variety of Stimulation 


The Early Childhood version consists of 55 items 
organized into eight subscales, whereas the Middle 
Childhood version consists of 59 items organized 
into eight subscales. The items for the Infant and 
Toddler version of the HOME Inventory are listed 
in Table 10.1. Details on the specific items included 
in the HOME can be found in Caldwell and 
Bradley (1984). 


Technical Features 


Relevant norms for the HOME Inventory are avail- 
able from several sources. For the Infant and Tod- 
dler version, Caldwell and Bradley (1984) report 
subscale means and standard deviations for 174 
families from Little Rock, Arkansas. Compared to 
the general population, this sample appears to 
overrepresent lower-SES families. For example, 34 
percent of the families were on welfare, and 29 
percent were single-parent households. For the 
Early Childhood version, standardization data 
were available from 232 families in Little Rock, 
with lower-SES families similarly overrepre- 
sented. For the Middle Childhood version, Bradley 
and Rock (1985) report subscale means and stan- 
dard deviations for 141 families from Little Rock. 
Approximately half of these families were African 
American, the remainder caucasian; boys and girls 
were sampled equally. These families were thought 
to be representative of all families rearing elemen- 
tary-aged children in Little Rock, Arkansas. How- 
ever, for all three versions it is clear that the 
standardization samples provide only local norms. 


These data may be useful as points of reference but 
should not be equated with a stratified, random, na- 
tional sample. 

The reliability of the HOME Inventory has 
been demonstrated in a variety of ways, particularly 
for the Infant and Toddler version, which we dis- 
cuss here. The authors note that short-term test- 
retest studies are inappropriate, since a respondent 
is quite likely to remember a specific answer given 
to a question, which would artificially inflate test- 
retest correlations (Bradley & Caldwell, 1984). 
Methods used for the assessment of reliability 
included interobserver agreement, internal consis- 
tency, and long-range test-retest stability coeffi- 
cients for 91 families from the standardization 
sample. By definition, interobserver agreement for 
the subscale items is reported to be 90 percent or 
higher, since this is the training criterion for 
new raters. Internal consistency estimates using 
Kuder-Richardson formula 20 ranged from .67 to 
.89 for all subscales except Variety of Stimulation, 
which yielded a coefficient of only .44. This rather 
low reliability coefficient was due to the small 
number of items in the subscale (five). Test-retest 
data were available from 91 families tested when 
their infant/toddler was 6, 12, and 24 months of 
age. The coefficients indicated a moderate to high 
degree of stability for the subscales, with most cor- 
relations in the .50s, .60s, and .70s. The correlation 
between total score for testings at 12 and 24 months 
of age was a highly respectable .77. 

The validity of the HOME Inventory has been 
bolstered by research findings that show modest 
correlations with SES indices. Because the inven- 
tory was proposed as a more meaningful, sensitive 
index of environment than social class, HOME 
scores should be significantly but not highly related 
to SES indices. For the Infant and Toddler version, 
HOME Inventory subscale correlations with SES 
are mainly in the .30s and .40s, while the total 
score-SES correlation is .45 (Bradley, Rock, Cald- 
well, & Brisby, 1989). HOME scores also revealed 
a strong relationship with poverty status in Cau- 
casian and minority samples (Bradley, Corwyn, 
Pipes McAdoo, & Garcia Coll, 2001). Further, 
higher HOME scores predicted that children would 





TABLE 10.1 
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List of Subscales and Items for the Infant and Toddler Version of the HOME Inventory 





HOME Inventory 


Place a plus (+) or minus (—) in the box alongside each item if the behavior is observed during the visit or if the par- 
ent reports that the conditions or events are characteristic of the home environment. Enter the subtotal and the total 


on the front side of the Record Sheet. 


I. Emotional and Verbal RESPONSIVITY 

[U] 1. Parent spontaneously vocalized to child twice. 

U) 2. Parent responds verbally to child’s verbalizations. 

U 3. Parent tells child name of object or person during 
visit. 

[U] 4. Parent’s speech is distinct and audible. 

O 5. Parent initiates verbal exchanges with visitor. 

U] 6. Parent converses freely and easily. 

U 7. Parent permits child to engage in “messy” play. 

[] 8. Parent spontaneously praises child at least twice. 

O 9 

O 

o 

O 


. Parent’s voice conveys positive feelings toward 
child. 


10. Parent caresses or kisses child at least once. 


11. Parent responds positively to praise of child of- 
fered by visitor. 
Subtotal 


II. ACCEPTANCE of Child’s Behavior 

O 12. Parent does not shout at child. 

C 13. Parent does not express annoyance with or hostility 
to child. 

C] 14. Parent neither slaps nor spanks child during visit. 

U] 15. No more than one instance of physical punishment 
during past week. 

CO 16. Parent does not scold or criticize child during the 
Visit. 

U] 17. Parent does not interfere or restrict child more than 
3 times. 

CO 18. At least ten books are present and visible. 

O 19. Family has a pet. 

C Subtotal 


Ill. ORGANIZATION of Environment 


C 20. Substitute care is provided by one of three regular 
substitutes. 

[1 21. Child is taken to grocery store at least once/week. 

O 22. Child gets out of house at least four times/week. 

U] 23. Child is taken regularly to doctor’s office or clinic. 

O 24. Child has a special place for toys and treasures. 

O 25. Child’s play environment is safe. 

U Subtotal 


IV. Provision of PLAY MATERIALS 

O 26. Muscle activity toys or equipment. 

U] 27. Push or pull toy. 

O 28. Stroller or walker, kiddie car, scooter, or tricycle. 

O 29, Parent provides toys for child during visit. 

U] 30. Learning equipment appropriate to age—cuddly 
toys or role-playing toys. 

U] 31. Learning facilitators—mobile, table and chairs, 
high chair, play pen. 

U] 32. Simple eye-hand coordination toys. 

C 33. Complex eye-hand coordination toys (those per- 
mitting combination). 

O 34. Toys for literature and music. 

C Subtotal 


V. Parental INVOLVEMENT with Child 

U] 35. Parent keeps child in visual range, looks at often. 

U 36. Parent talks to child while doing household work. 

CO 37. Parent consciously encourages developmental ad- 
vance. 

C 38. Parent invests maturing toys with value via per- 
sonal attention. 

U] 39. Parent structures child’s play periods. 

C 40. Parent provides toys that challenge child to de- 
velop new skills. 

C Subtotal 


VI. Opportunities for VARIETY 

U] 41. Father provides some care daily. 

C 42. Parent reads stories to child at least 3 times 
weekly. 

U] 43. Child eats at least one meal per day with mother 
and father. 

D 4. Family visits relatives or receives visits once a 
month or so. 

O 45. Child has 3 or more books of his/her own. 

CO Subtotal 


[= TOTAL SCORE 


For complete wording of items, please refer to the 
Administration Manual. 





Source: Reprinted with permission from Caldwell, B. M., & Bradley, R. H. (1984). Home Observation for Measurement of the Environ- 


ment. Little Rock: University of Arkansas at Little Rock. 
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exhibit fewer behavior problems and better pre- 
school ability in a study of 93 single African Amer- 
ican mothers (Jackson, Brooks-Gunn, Huang, & 
Glassman, 2000). 

HOME scores also show strong, theory-con- 
firming relationships with appropriate external 
criteria, including language and cognitive develop- 
ment, school failure, therapeutic intervention, and 
mental retardation (Caldwell & Bradley, 1984). The 
correlations between HOME scores and intellectual 
measures such as the Bayley Scales of Infant De- 
velopment and the Stanford-Binet are particularly 
informative. Bradley and Caldwell (1984) con- 
ducted a longitudinal study with 174 families, ad- 
ministering the HOME at 6, 12, and 24 months of 
age and correlating these scores with the Bayley 
Mental Development Index (MDI) at 12 months 
and with Stanford-Binet IQ at 36 months and 54 
months of age. The pattern of correlations indicated 
stronger predictive than concurrent validity; that is, 
HOME-IQ correlations at 36 months and 54 
months of age were higher than HOME-MDI cor- 
relations at 12 months of age (Table 10.2). Factor- 


TABLE 10.2 Correlations between HOME Scores 
at 12 months and Cognitive Scores at 12, 36, and 
54 Months 








HOME MDI at IQ at IQ at 
Subscale 12 Months 36 Months 54 Months 

Responsivity 15 39% .34* 
Restriction .01 .24* 21 
Organization .20 .39* .34* 
Play Materials .28* .56* .52* 
Involvement .28* AT* .36* 
Variety 05 .28* oan 
Total Score 30% .58* ZW 
"p<.05 


Note: MDI refers to the Mental Development Index from the Bay- 
ley Scales of Infant Development; IQ is from the Stanford-Binet. 


Source: Reprinted from Bradley, R. H., & Caldwell, B. M. 
(1984). 174 children: A study of the relationship between home 
environment and cognitive development during the first 5 years. In 
A. W. Gottfried (Ed.), Home environment and early cognitive de- 
velopment: Longitudinal research. Orlando, FL: Academic Press. 
Copyright 1984. Reprinted with permission from Elsevier. 


analytic studies of the HOME also support the con- 
struct validity of this instrument (Bradley, Mund- 
from, Whiteside, and others, 1994; Mundfrom, 
Bradley, & Whiteside, 1993). 

In addition to its usefulness as a research tool, 
the HOME shows promise as a clinical instrument. 
Because low HOME scores are predictive of risk 
for intellectual disability, the inventory can be used 
to identify children for whom remedial intervention 
would be appropriate. This kind of intervention is 
more than just a humanistic application of research 
findings—it may be required by law: 


Among the more compelling new reasons for in- 
vestigating the environments of handicapped chil- 
dren is the requirement in P. L. 99-457 that all 
service plans for preschool-age handicapped chil- 
dren include a component for parents. It will no 
longer be sufficient to have a plan of remediation 
aimed exclusively at the child. Plans for parental 
involvement and for developing the capacities of 
parents must also be included. (Bradley et al., 
1989) 


In sum, the HOME inventory shows promise not 
only in research, but also as a practical adjunct to 
intervention. 


INTELLECTUAL EVALUATION 
OF PRESCHOOL CHILDREN 


By definition screening tests are less accurate than 
comprehensive assessments; that is, they merely 
signal the need for further evaluation. As part of any 
follow-up evaluation, a practitioner would admin- 
ister a variety of instruments, almost certainly in- 
cluding an individual intelligence test. Although 
test scores of preschoolers on intelligence measures 
are notoriously unstable in the long run, in the short 
run there is probably no better index of whether a 
child is at-risk for school failure. 

Several individually administered intelligence 
tests are suitable for preschool children. The most 
widely used are the Stanford-Binet: Fourth Edi- 
tion (SB:FE), the Wechsler Preschool and Primary 
Scale of Intelligence (WPPSI-R), the Kaufman As- 
sessment Battery for Children (K-ABC), and the 


Differential Ability Scales (DAS). These instru- 
ments were reviewed in an earlier topic. 


ASSESSMENT OF LEARNING 
DISABILITIES AND RELATED 
DISORDERS 


The learning disability (LD) field is one the fastest 
growing areas within assessment. Paradoxically, 
it is also one of the most controversial and per- 
plexing domains of psychological testing. Consid- 
erable background is needed to understand the role 
of psychological tests in the evaluation of learning 
disabilities. We begin by asking a seemingly sim- 
ple question that turns out to have a complicated 
answer: What is a learning disability? 


The Federal Definition of Learning Disabilities 


For decades the essential nature of learning dis- 
abilities has been understood in terms of a defini- 
tion embedded in federal law. In 1975, Congress 
passed Public Law 94-142, the Education for All 
Handicapped Children Act. One of the provisions 
of this act was a definition of learning disabilities 
as follows: 


The term “specific learning disability” means a 
disorder in one or more of the basic psychological 
processes involved in understanding or in using 
language, spoken or written, which may manifest 
itself in imperfect ability to listen, speak, read, 
write, spell, or to do mathematical calculations. 
The term includes such conditions as perceptual 
handicaps, brain injury, minimal brain dysfunc- 
tion, dyslexia, and developmental aphasia. The 
term does not include children who have learning 
disabilities which are primarily the result of 
visual, hearing, or motor handicaps, of mental re- 
tardation, or emotional disturbance, or of environ- 
mental, cultural, or economic disadvantage. 
(USDE, 1977, p. 65083) 


The commitment to a federally mandated definition 
was reaffirmed in 1990 by passage of Public Law 
101-476, the Individuals with Disabilities Educa- 
tion Act (IDEA). Slightly more than half of the 
states in the United States now follow this model. 
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The remaining states mandate similar approaches 
(Lerner, 1993). 

The federal definition embodied in IDEA also 
stipulates an operational approach to the identi- 
fication of children with learning disabilities. 
Specifically, candidates for an LD diagnosis must 
demonstrate a severe discrepancy between general 
ability (intelligence) and specific achievement in 
one or more of these seven areas: 


Oral expression 
Listening comprehension 
Written expression 

Basic reading skill 
Reading comprehension 
Mathematics calculation 
Mathematics reasoning 


The discrepancy model for the identification of LD 
children has functioned as a directive for school 
psychologists. In effect, the model mandates that 
psychologists should administer an individual in- 
telligence test (general ability measure) and an in- 
dividual achievement test (specific achievement 
measure) and then look for a discrepancy between 
Full Scale IQ and one or more areas of school 
achievement (e.g., reading, mathematics, written 
expression). 

In practical terms, a severe discrepancy has been 
defined as a difference of one standard deviation or 
more between general intelligence and specific 
achievement. A common practice in identification 
of LD children is to compare Full Scale IQ on an in- 
dividual intelligence test such as the WISC-III with 
specific achievement scores on an: individual 
achievement test such as the WIAT (Wechsler Indi- 
vidual Achievement Test) or similar instrument that 
has subtests normed with a mean of 100 and a stan- 
dard deviation of 15. A difference of 15 points or 
more between Full Scale IQ and specific achieve- 
ment in any of the previously listed areas would 
then raise the suspicion of learning disability. 

Unfortunately, the federal definition has not 
served its intended purposes, and, increasingly, 
school psychologists and other professionals look to 
other approaches for understanding and assessing 
learning disabilities in children. The fundamental 
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problem is that many, many children who exhibit se- 
rious learning problems in school and who would 
benefit from services for LD simply do not meet 
the psychometric criteria of a severe discrepancy. 
This is because a learning disability may adversely 
affect performance on both the intelligence and 
the achievement measures used to diagnose it, re- 
sulting in a test profile that does not fit the discrep- 
ancy model but nonetheless is LD (Shaw, Cullen, 
McGuire, & Brinckerhoff, 1995). Another problem 
is that individual states have adopted different dis- 
crepancy formulas, such that a child is viewed as 
having an LD in one location, but not another. Not 
only does this create confusion, it undermines the 
integrity of the entire enterprise of LD identifica- 
tion. Finally, an additional problem is that the preva- 
lence of a severe discrepancy fluctuates wildly as a 
function of which tests are used. Consider a study 
by Schultz (1997) in which 62 at-risk fifth graders 
were evaluated for LD first with the WISC-R and 
shortly thereafter with the WISC-III. In both cases 
the usual 15-point discrepancy (between Wechsler 
IQs and standardized achievement scores) was used 
to operationalize the classification of students as 
LD. Whereas 86 percent of the children met the cri- 
teria for LD with the WISC-R, only 48 percent qual- 
ified with the WISC-III a few months later. 


Emerging Consensus: A New 
Definition of Learning Disabilities 


After a lengthy period of confusion and struggle 
over the definition of learning disabilities, special- 
ists and educators began to rally around a consen- 
sus view in the early 1990s. The new definition was 
proposed by the National Joint Committee on 
Learning Disabilities (NJCLD), a group of repre- 
sentatives from eight national organizations with a 
special interest in learning disabilities. Although 
similar to the federal definition, the new approach 
contains important contrasts: 


Learning disabilities is a general term that refers 
to a heterogeneous group of disorders manifested 
by significant difficulties in the acquisition and 
use of listening, speaking, reading, writing, rea- 
soning, or mathematical abilities. These disorders 


are intrinsic to the individual, presumed to be 

due to central nervous system dysfunction, and 
may occur across the life span. Problems in self- 
regulatory behaviors, social perception and social 
interaction may exist with learning disabilities but 
do not by themselves constitute a learning disabil- 
ity. Although learning disabilities may occur con- 
comitantly with other handicapping conditions (for 
example, sensory impairment, mental retardation 
[MR], serious emotional disturbance [ED]) or with 
extrinsic influences (such as cultural differences, 
insufficient or inappropriate instruction), they are 
not the result of those conditions or influences. 
(NJCLD, 1988, p. 1) 


The new definition avoids vague reference to “basic 
psychological processes,” specifies that the disor- 
der is intrinsic to the individual, identifies central 
nervous system dysfunction as the origin of LD 
problems, and states explicitly that learning dis- 
abilities may extend into adulthood. 

Perhaps most important of all, the NJCLD ap- 
proach abandons the excessive reliance upon dis- 
crepancy between ability and achievement as the 
hallmark of LD. Instead, the new model specifies 
that the necessary (but not sufficient) condition of 
LD is that the individual (child or adult) exhibit an 
intraindividual weakness in one or more of the core 
areas of academic functioning (listening, speaking, 
reading, writing, reasoning, or mathematical abili- 
ties). Shaw et al. (1995) illustrate how the NJCLD 
model might look in practice (Figure 10.1). In this 
approach, the first task is to identify one or more in- 
traindividual weaknesses in the core areas. These 
are always relative to strengths in several other core 
areas. In other words, persons who are slow learn- 
ers in all areas do not meet the criteria of LD. The 
second step is to trace the learning difficulties to 
central nervous system dysfunction, which may 
manifest as problems with information processing. 
For example, a young adult with a severe weakness 
in listening (as judged by her inability to learn from 
the traditional lecture approach to teaching) might 
exhibit a deficit on a test of verbal memory— 
confirming that an information-processing problem 
was at the heart of her disability. The purpose of 
the third step (examining psychosocial skills, physi- 
cal and sensory abilities) is to specify additional 














Step 1. Intraindividual Discrepancy 

The examiner identifies a significant difficulty in one 
or more core areas alongside relative strengths in sev- 
eral areas. Core Areas: Listening, Speaking, Reading, 
Writing, Reasoning, Math, Subject Area. 


Step 2. Discrepancy Intrinsic to the Individual 

The examiner traces the discrepancy to central nervous 
system dysfunction (e.g., brain injury) or links the dis- 
crepancy with information-processing problems (e.g., 
memory, organization, or learning efficiency). 


Step 3. Related Considerations 


The examiner evaluates the relevance of psychosocial 
skills, physical abilities, and sensory abilities to the 
learning disability. 


Step 4. Alternative Explanations 


The examiner rules out alternative explanations, e.g., 
environmental, cultural, or economic factors; or inap- 
propriate or inadequate instruction. 


Step 5. LD Diagnosis 


The examiner determines that children who pass steps 
1 through 4 meet the criteria for an LD diagnosis. 





FIGURE 10.1 Operationalizing the NJCLD Definition of 


Learning Disability 

Source: Based on Brinckerhoff, L., Shaw, S., & McGuire, J. (1993). 
Promoting postsecondary education for Students with learning dis- 
abilities: A handbook for practitioners. Austin, TX: PRO-ED. 


problems that may need to be addressed for 
program-planning purposes. Finally, in the fourth 
step the examiner rules out non-LD explanations for 
the learning difficulties (since these explanations 
would mandate a different strategy for remediation). 

Hammill (1990) has noted wisely that political 
realities are such that the NJCLD definition may 
never replace the federally mandated approach. But 
this is less important than whether parents and pro- 
fessionals unite around one definition for purposes 
of research and communication with one another. 
At the present time, the NJCLD definition has re- 
ceived the strongest general support from profes- 
sionals in the field of assessment. 

Although we will not discuss additional view- 
points here, we can refer the reader to several other 
respected models of learning disabilities (Special 
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Education Today, 1985; Lerner, 1993; Lyon, 1994). 
Offered many years ago, the wise counsel of Farn- 
ham-Diggory (1978) is still worth mentioning in 
this context. Shortly after Public Law 94-142 was 
activated in 1977 with its influential definition of 
learning disabilities, she wrote: 


Publishing this definition has amounted to waving 
a red flag in front of a herd of bulls—parents and 
professionals alike. Far from clarifying the situa- 
tion, the definition inspired so much snorting and 
ground-pawing that the conceptual dust has grown 
thicker than ever. Part of the problem arises from 
the fact that we lose sight of what a definition is 
for. Definitions are not truth: they merely. set up the 
conditions under which particular actions are to.be 
taken. (Farnham-Diggory, 1978) 


The actions referred to may include research in- 
vestigations, diagnostic assessments, and/or educa- 
tional interventions. But whatever response is 
undertaken, we must guard against the proclivity 
to view definitions as true or false. Definitions are 
merely human inventions with greater or lesser util- 
ity, nothing more. 


Essential Features of Learning Disabilities 


Even though the definition of LD remains a point 
of contention, we can cite several features of these 
disorders that are less controversial. As the reader 
will discover, the features discussed in the follow- 
ing dictate, to some extent, the nature of testing 
practices in the assessment of learning disabilities. 
There is general agreement—with occasional dis- 
senting votes—on the following features of learn- 
ing disabilities: 


1. A learning disability involves an intraindividual 
discrepancy in cognitive functioning. The child 
(or adult) with LD reveals a relative weakness in 
one area compared to strengths in most other 
areas. According to the federal definition fol- 
lowed within many school systems, the discrep- 
ancy is between general ability (intelligence) 
and specific achievement. We have described 
previously some of the pitfalls of this definition 
and prefer the NJCLD approach in which the 
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discrepancy is not rigidly tied to a difference be- 
tween IQ and achievement test scores. 

. An exclusionary clause is included in most defi- 
nitions of learning disability. If the academic dif- 
ficulties are primarily caused by other disabling 
conditions (mental retardation, emotional distur- 
bance, visual or hearing impairment, cultural or 
social disadvantage), then a diagnosis of learning 
disability is typically ruled out. This clause is 
often misinterpreted. A person can be both learn- 
ing disabled and impaired in other ways (e.g., 
have mental retardation). The important point is 
that the coexisting condition must not be the pri- 
mary cause of the learning difficulties. 

. Learning disabilities are heterogeneous; that is, 
there are many different varieties. Research on 
the identification of subtypes is still in its in- 
fancy, but most researchers express optimism 
that meaningful subgroups of persons with 
learning disabilities can be identified. Pending 
further research and refinement, only two broad 
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categories of learning disability are recognized 
currently (Forster, 1994). These are as follows: 
e Dyslexia or verbal learning disability 

e Right hemisphere or nonverbal learning 

disability. 

The characteristics of these two major categories 
of LD are outlined in Table 10.3. These patterns 
have emerged in many studies of children with 
LD. For example, Blakely, Crinella, Fisher, Cham- 
paigne, and Beck (1994) distinguished both a ver- 
bal LD and a nonverbal LD in a sophisticated 
analysis of neuropsychological test scores for 177 
children 9 to 14 years of age. The sample consisted 
of 129 children with LD (including 37 with verified 
brain damage) and 48 children with no evidence 
of LD or brain damage. Six patterns of neuro- 
psychological performance were identified by 
means of a complex statistical clustering method: 


1. Very Low IQ: Children with very low IQ but 
otherwise nondiscrepant test scores 


TABLE 10.3 Characteristics of Two Broad Categories of Learning Disability 


Dyslexia or Verbal 
Learning Disability 


Primary Manifestation 


Unexpected difficulty in learning to read 


or spell 
Fundamental Deficiency 


Problems in phonological coding (associ- 
ating sounds with letter combinations) 


Physiological Correlates 


Subtle anomalies in the left cerebral 
hemisphere (revealed by brain scans and 


EEG studies) 


Relative Prevalence 
About 90% of all LD cases 


Ratio of boys to girls 
3:1 or 4:1 


Right Hemisphere 
or Nonverbal Learning Disability 


Poor skills in mathematics, handwriting, 
or social cognition 


Problems in spatial cognition (visuospa- 
tial perception of relationships) 


Likely origin in right cerebral hemi- 
sphere dysfunction 


About 10% of all LD cases 


1:1 





Source: Based on Forster, A. (1994). Learning disabilities. In R. J. Sternberg (Ed.), Encyclopedia of 


human intelligence. New York: Macmillan. 


2. Low IQ: Children with low IQ but otherwise 
nondiscrepant test scores 

3. Clumsy/Lethargic: Children whose test 
scores indicate clumsiness and lethargy 

4. Language Dysfunction: Children with rela- 
tively low scores on language variables 

5. Spatial Dysfunction: Children with good ver- 
bal function but faulty spatial orientation 

6. No Deficit: Children with no detectable 
deficits 


Groups 4 and 5 correspond to the major cate- 
gories of LD listed previously (verbal and non- 
verbal LD), whereas the other groups signify 
normalcy (group 6), low intellectual ability 
(groups 1 and 2), or variant forms of possible 
LD (group 3). Of course, some of the children 
in groups 1 and 2 might meet the criteria for LD 
when assessed with additional tests. 

. A learning disability is a developmental phe- 
nomenon that is usually evident in early child- 
hood that may persist into adulthood. Even 
though remediation efforts should be based 
upon optimism—so as to avoid self-fulfilling 
prophecies—a dose of realism is needed, too. 
Longitudinal studies of children with severe 
learning disabilities suggest that marked im- 
provement in academic achievement is the ex- 
ception, not the rule, even when these subjects 
receive intensive educational intervention. For ex- 
ample, Frauenheim and Heckerl (1983) retested 
11 adults diagnosed as having learning disabili- 
ties in childhood. All the participants had received 
special help for reading; nine had graduated from 
high school, and two completed the tenth grade. 
Full Scale IQs were typically in the low 90s, 
with Verbal IQ below average (mean of 85) and 
Performance IQ above average (mean of 104). In 
spite of the remedial intervention, when retested 
as adults on exactly the same achievement test 
(Wide Range Achievement Test), these exami- 
nees were scarcely improved from their ele- 
mentary school results. These findings are 
corroborated by several other follow-up studies 
(see Kolb & Whishaw, 1990, chap. 29, for a re- 
view). Such results indicate that specialists who 
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work with children with learning disabilities 
should not become fixated solely upon aca- 
demic concerns. Social and emotional prob- 
lems—which may be more amenable to 
intervention—also cry out for notice. 

5. Individuals with learning disabilities frequently 
experience social and emotional difficulties that 
are as pervasive and consequential as the deficits 
in academic achievement. These problems may 
persist into adolescence and adulthood. In fact, 
the socioemotional sequelae often become the 
primary presenting complaint, which can com- 
plicate the testing process and obscure the diag- 
nosis. For example, in a needs assessment study 
of 381 adults with learning disabilities; Hoffman, 
Sheldon, Minskoff, and others (1987) identified 
several crucial nonacademic areas meriting in- 
tervention by service providers. These adults 
self-endorsed several social and emotional prob- 
lems with high frequency: feeling frustrated (40 
percent), talking or acting before thinking (33 
percent), being shy (31 percent), no self-confi- 
dence (28 percent), controlling emotions and 
temper (28 percent), and dating (27 percent). 
Many other problems were also endorsed, but by 
less than 25 percent of the sample. These find- 
ings indicate that learning disability assessments 
should incorporate measures of social and emo- 
tional functioning. Vaughn and Haager (1994) 
provide an excellent overview on the measure- 
ment of social skills in persons with learning 
disability. 


Causes and Correlates of Learning Disabilities 


Approximately 4 to 5 percent of all school-aged 
children receive a diagnosis of LD, so this is not a 
rare problem (Chalfant, 1989; Lyon, 1996). The 
most common form of LD is dyslexia, and boys out- 
number girls by about 3:2 (Nass, 1992). In a mi- 
nority of cases, the etiology is clear and can be 
attributed to a specific cause such as a known brain 
injury. The reader will recall from Chapter 9 that left 
hemisphere impairment is especially likely to result 
in verbal difficulties, whereas right hemisphere 
impairment may lead to problems with spatial 
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thinking or other nonverbal skills. Thus, head injury 
or other neurological problems can be the proximate 
cause of a child receiving an LD diagnosis. 
However, in the majority of cases the direct eti- 
ology of LD problems is unclear. A number of pos- 
sibilities have been proposed and these may explain 
some but not all cases of LD. For example, patho- 
logical neurodevelopmental processes have been 
identified in some persons with severe dyslexia (Cul- 
bertson & Edmonds, 1996). Individuals with this 
disorder appear to have alterations in brain structures 
such as the planum temporale (the flat surface on the 
top of the temporal lobes) known to be important for 
language processing. Whereas in normal individuals 
the planum temporale is much larger in the left tem- 
poral lobe than in the right, persons with severe 
dyslexia do not show this pattern of asymmetry 
(tending toward symmetry instead). Moreover, re- 
searchers have identified microscopic cortical mal- 
formations called polymicrogyria (numerous small 
convolutions) that parallel these structural differ- 
ences. Several postmortem studies of persons with 
severe dyslexia have revealed these deviations at the 
cellular level. Spreen (2001) provides an outstanding 
review of the possible neurological substrates of 
learning disabilities. Dyslexia also appears to show 
a significant genetic component for some persons 
such that the idea of familial dyslexia needs to be 
taken seriously. However, what must be emphasized 
is that for most individuals the etiology of LD 
(whether dyslexia or other forms) remains a mystery. 


Assessment of Learning Disabilities 


Learning disabilities manifest primarily as acade- 
mic problems; that is, a child with LD is typically 
unable to master skills important for school success 
such as reading, mathematics, or written commu- 
nication. Because school-based accomplishment is 
at the heart of the problem, an evaluation for LD 
must include relevant measures of academic achieve- 
ment. Furthermore, the evaluation of school achieve- 
ment—one small part of an LD assessment—must 
be based upon an individual test of achievement. 
Even though a group achievement test might raise 
the suspicion of a learning disability, practitioners 
must rely upon individual achievement tests for de- 


finitive assessment. We explain why this is so and 
then review useful instruments for achievement 
testing. 

Individual achievement tests typically are ad- 
ministered one on one with the examiner sitting 
across from the respondent and posing structured 
questions and problems. Of course, any well-stan- 
dardized achievement test will yield normative data 
about the functioning of a schoolchild. But the special 
virtue of individual achievement tests is that the ex- 
aminer can observe the clinical details of deficient (or 
superior) performance and form hypotheses about the 
cognitive capacities of the examinee. 

Consider the problem of poor spelling, widely 
observed in children and adults with verbal LD. 
Any good spelling achievement test will document 
the disability; however, little insight is gained from 
mere scores. What the examiner should seek to 
know is the qualitative nature of the problem, not 
just its quantitative dimensions. Individual achieve- 
ment tests are invaluable in this regard. By observ- 
ing the details of deficient performance, an astute 
examiner can form hypotheses about the origin of 
an achievement problem. For example, a child 
whose spelling is phonetically correct is at least 
hearing the words correctly, whereas a child with 
nonphonetic spelling might very well reveal a prob- 
lem with auditory processing of speech sounds. 


Individual Achievement Tests 


More than a dozen individually administered 
achievement tests exist, but only a few are widely 
used in the assessment of learning difficulties. 
Prominent individual achievement tests include the 
Diagnostic Achievement Battery-Second Edition 
(DAB-2), the Kaufman Test of Educational Achieve- 
ment (K-TEA), the Mini-Battery of Achievement 
(MBA), the Peabody Individual Achievement 
Test-Revised (PIAT-R), the Wechsler Individual 
Achievement Test (WIAT), the Woodcock-Johnson 
Psycho-Educational Battery-Revised (WJ-R), and 
the Wide Range Achievement Test-II] (WRAT-IID. 
The essential features of these tests are outlined in 
Table 10.4. Owing to limitations of space, we will 
select one test, the K-TEA, for more detailed pre- 
sentation. Readers who seek further information 
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TABLE 10.4 Survey of Widely Used Individual Achievement Tests 





Diagnostic Achievement Battery-2 (DAB-2) 
(Newcomer, 1990) 
Suitable for ages 6 through 14, the DAB-2 consists of 
12 subtests used to compute eight diagnostic compos- 
ites. The composite scores include Listening, Speaking, 
Reading, Writing, Mathematics, Spoken Language, 
Written Language, and total Achievement. More com- 
prehensive than most achievement tests, the DAB-2 
takes up to two hours to administer. The test was care- 
fully normed on 2,600 children nationwide. 


Kaufman Test of Educational Achievement (K-TEA) 
(Kaufman & Kaufman, 1985) 

A well-normed individual test of educational achieve- 

ment, a special feature of the K-TEA is the detailed 

error analysis (see text). Currently, norms extend only 

through high school age. A separate brief form that 

can be administered in 30 minutes or less is useful for 

screening purposes. 


Mini-Battery of Achievement (MBA) 

(Woodcock, McGrew, & Werder, 1994) 
Assesses four broad achievement areas—reading, writing, 
mathematics, and factual knowledge—for persons ages 
4 through 90+. The complete battery can be administered 
in 30 minutes. The MBA provides a more extensive cover- 
age of basic and applied skills than any other brief battery. 
For example, the reading component assesses letter-word 
identification, vocabulary, and comprehension. 


Peabody Individual Achievement Test-Revised 
(PIAT-R) (Markwardt, 1989) 
For ages 5 through 18, this 60-minute test includes sub- 
tests of general information, reading recognition, read- 
ing comprehension, mathematics, and spelling. A new 
subtest, written expression, is now offered for screening 
written language skills. Administration of the PIAT-R 
requires minimal training; the test can be administered 
by properly trained classroom teachers. 


Wechsler Individual Achievement Test-II (WIAT-II) 
(Wechsler, 2001) 


The WIAT-II consists of nine subtests: oral language, 
listening comprehension, written expression, spell- 
ing, word reading, pseudoword decoding, reading 
comprehension, numerical operations, and mathemat- 
ics reasoning. The test is suitable for children age 4 
through adults age 89 and is empirically linked with 
all of the Wechsler intelligence scales. Administration 
to older persons can take up to 75 minutes. Selected 
subtests can be administered for brief screening 


purposes. 


Woodcock-Johnson-Revised (WJ-R) 

(Woodcock & Johnson, 1989) 
The Woodcock-Johnson-Revised covers individuals 
from 2 years of age through adulthood. The full 
battery encompasses three areas of functioning: 
achievement, cognitive ability, and interest. The 
nine standard achievement tests yield cluster scores 
in areas labeled Broad Reading, Broad Mathematics, 
Broad Written Language, Broad Knowledge, and 
Skills. The achievement tests are widely respected, 
but some reviewers recommend caution in the use 
of the cognitive battery. 


Wide Range Achievement Test-III (WRAT-II) 
(Wilkinson, 1993) 
Well-normed for ages 5 through 75, the WRAT-III 
is widely used as a screening instrument. The subtests 
include reading, spelling, and arithmetic. The major 
weakness of the battery is the reading subtest, which 
is really only a measure of word recognition. The 
reading subtest consists of asking the examinee to 
pronounce aloud each word from a list ranging from 
simple to difficult. Because of the limited item con- 
tent and the high intercorrelations among the subtests, 
the WRAT-R is unsuited for the identification of spe- 
cific skill deficits. 





about individual achievement tests are encouraged 
to consult Sattler (1988, chap. 13) and the Mental 
Measurements Yearbook series. 


Kaufman Test of Educational 
Achievement (K-TEA) 


The K-TEA is an untimed test of educational 
achievement for children ages 6 through 18. A brief, 
three-subtest version exists, but for diagnostic as- 


sessment of learning difficulties the Comprehen- 
sive Form is preferred. The K-TEA Comprehensive 
Form consists of five subtests: Reading Decoding, 
Reading Comprehension, Mathematics Applica- 
tions, Mathematics Computation, and Spelling. 
Testing time is approximately one hour for older 
examinees, but somewhat less for younger children. 

Brief examples of K-TEA-like items are shown 
in Table 10.5. These examples would be at the upper 
end of the subtests, appropriate for high-school 
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TABLE 10.5 Examples of Characteristic K-TEA 
Items Applicable to Older Children 


Reading Decoding 
The examiner points to each word in turn and says, 
“What word is this?” 


duodecagon obstreperous correlative 
indolence perspicacity 
Reading Comprehension 


The examiner says, “Do what this says.” 
Utter a fallacious response to the question, “How 
many eyes does a cyclops have?” 


Spelling 
The examiner explains the rules for a traditional 
spelling test concluding with, “I want you to write the 
word on this sheet.” 

“Paramour. One’s lover is called a paramour.” 


Mathematics Computation 
The examiner says “Now I want you to work these 
problems.” 

(X —7)(X -9) = 5lb 5oz 
-21b 14 oz 


Mathematics Application 

The examiner says, “The Missoula Muggers played 80 
ballgames last year. They won 16 games. What percent- 
age of the games did they win?” 





students. The K-TEA utilizes entry and exit rules 
for each subtest to ensure that students only en- 
counter items of appropriate difficulty. Scoring is 
completely objective, one for correct items, zero for 
incorrect items. Raw scores are converted to stan- 
dard scores (mean of 100, SD of 15) for each sub- 
test, the reading composite, the mathematics 
composite, and the entire battery composite. 

In addition to formal scoring, the K-TEA pro- 
vides a systematic method for evaluating the qual- 
itative nature of subtest errors. For example, on the 
spelling subtest, errors can be classified according 
to whether they involve prefixes, suffixes, vowel 
digraphs (such as ve in blue) and diphthongs, con- 
sonant clusters (such as scr in unscrupulous), r- 
controlled patterns (such as er in inferior), and 
several other patterns. 


Kaufman and Kaufman (1985) stress that the 
error analysis provides the diagnostician with a 
source of information from which instructional 
objectives can be developed. For example, a weak- 
ness in vowel digraphs and diphthongs on the 
Spelling subtest translates directly to classroom 
objectives: practice in the spelling and reading of 
these elements in isolation, progressing to spelling 
and pronouncing words containing digraphs and 
diphthongs, and ending in writing and reading 
sentences containing words with vowel digraphs 
and diphthongs. The K-TEA manual contains 
many useful clinical insights with educational 
ramifications. 

Although the normative samples of the K-TEA 
and the description of their characteristics are a 
model of excellence, the technical characteristics 
of this instrument vary in adequacy. Split-half reli- 
abilities for the five subtests range from .87 to .96, 
quite acceptable values for an achievement test. 
Stability data are less clear. Data for 172 students 
who were retested within 35 days were collapsed 
for grades 1 to 6 and grades 7 to 12. All correlations 
for subtests and composites exceeded .90, but these 
values are likely inflated because they confound 
stability of achievement with grade x achievement 
correlations. 

The content validity of the K-TEA appears to 
be very strong, but this point may vary from one 
school system to another. After all, individual 
school systems may choose to emphasize different 
domains of achievement. Salvia and Ysseldyke 
(1991) warn that users must be sensitive to the cor- 
respondence of K-TEA content with the students’ 
curriculum. As with any achievement test, the user 
should verify that the content of the K-TEA is ap- 
propriate within the curricular setting. Nonetheless, 
Kaufman and Kaufman (1985) offer sufficient evi- 
dence for the validity of the test to make a case for 
general adequacy. 


Test Batteries in the Assessment 
of Learning Disability 


Although experts agree that the assessment of a po- 
tential learning disability requires a multifaceted 


approach, there is little consensus as to the best in- 
struments and techniques. Of course, the most es- 
sential tools in the assessment of children with LD 
are reliable and valid measures of intelligence and 
achievement. Virtually every LD test battery in- 
cludes mainstream instruments in both areas, for 
example, SB:FE, WPPSI-R, or WISC-III for intel- 
lectual assessment, and PIAT-R, K-TEA, WJ-R, or 
WIAT for measurement of achievement. However, 
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the choice of ancillary measures for examining lan- 
guage skills, specific forms of information pro- 
cessing, or visual-spatial processing will differ 
from one practitioner to another. Furthermore, 
many examiners will individualize each test battery 
in light of the referral issues. In Case Exhibit 10.1, 
we provide a representative test battery for the LD 
assessment of Jimmy, a nine-year-old boy with a 
history of school failure. 
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| ASSESSMENT OF ADHD 


Attention-Deficit/Hyperactivity Disorder (ADHD) 
is the term proposed by the American Psychiatric 
Association to designate a behavioral syndrome 
previously known as attention-deficit disorder with 
hyperactivity, minimal brain dysfunction, and hy- 
perkinesis (American Psychiatric Association, 
1994). Children with this disorder often exhibit 
academic underachievement. Therefore, they are 
frequently referred for learning disability assess- 
ment. However, even though LD and ADHD often 
coexist and their symptoms overlap. slightly, they 
are conceptually distinct syndromes. Here, we re- 
view certain useful instruments designed to help di- 
agnose the ADHD syndrome. 

ADHD comes in three varieties: predominantly 
inattentive, predominantly hyperactive-impulsive, 
and combined type. The inattentive type is defined 
by the presence of six or more symptoms of inat- 
tention, but fewer than six symptoms of hyperactiv- 
ity-impulsivity. The hyperactive-impulsive type is 
defined by the presence of six or more symptoms of 
hyperactivity-impulsivity, but fewer than six symp- 
toms of inattention. The combined type consists of 
six or more symptoms in both clusters. In all cases, 
the symptoms must be present for at least six 
months and lead to impairment in social, academic, 
or occupational functioning. The diagnostic symp- 
toms for ADHD are summarized in Table 10.6. 

Unfortunately, these official criteria are dis- 
piritingly vague and relatively common even in 
normal children. No wonder that estimates of hy- 
peractivity range from 3 percent to 20 percent in 
the school-age population (Cantwell, 1975). To fur- 
ther complicate matters, experts in this field em- 
phasize other elements in the ADHD picture. For 
example, Barkley (1981) notes that ADHD children 
are mainly deficient in situations in which instruc- 
tions require delayed responding and/or sustained 
responding according to tightly defined rule sys- 
tems. He also emphasizes that these children per- 
form poorly when reinforcements are delayed. 
Thus, in the typical testing environment with inter- 
esting tasks and immediate rewards, an ADHD 
child often will appear quite normal. The examiner 
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TABLE 10.6 Diagnostic Symptoms of Attention- 
Deficit/Hyperactivity Disorder 
Inattentive Type 
(Six or More Symptoms) 
Lack of attention to details 
Difficulty sustaining attention 
Does not seem to listen 
Failure to follow through 
Difficulty organizing tasks 
Avoids sustained mental effort 
Loses things 
Easily distracted 
Forgetful in daily activities 


Hyperactive-Impulsive Type 

(Six or More Symptoms) 
Fidgets and/or squirms 
Leaves seat in classroom 
Inappropriate running or climbing 
Difficulty playing quietly 
Seems driven, always on the go 
Talks excessively 
Blurts out answers 
Difficulty waiting turn 
Interrupts or intrudes on others 


Combined Type 
Six or more of the symptoms in each of the above areas 





Source: Based on American Psychiatric Association. (1994). 
Diagnostic and statistical manual of mental disorders (4th ed.). 
Washington, DC: Author. 


who suspects that a child exhibits ADHD faces a 
daunting diagnostic challenge. Fortunately, several 
reliable and valid rating systems can be of assis- 
tance in making the diagnosis. 

Conners (1997) has produced three rating 
scales that are useful for identifying hyperactivity 
and other behavioral problems in children. Sepa- 
rate scales (similar in items and format) are 
available for parents, teachers, and adolescents to 
provide multiple perspectives on the attention- 
related behaviors of a referred child. The scales are 
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suitable for children ages 3 to 17, although the child 
self-report version is available only for older chil- 
dren ages 12 to 17. Each of the three scales is avail- 
able in both a short version and a long version. We 
will discuss here only the long version of the scale 
designed for parents. All three scales provide sim- 
ilar information and subscores. 

The Conners’ Parent Rating Scale-Revised 
(CPRS-R) is an 80-item scale developed to assess 
problem behaviors reported by parents. Normative 
data are based upon a sample of more than 8,000 
children and adolescents. The instrument includes 
the following subscales: 


e Oppositional 

Social Problems 

Cognitive Problems/Inattention 
Psychosomatic 

Hyperactivity 

DSM-IV Symptom Subscales 
Anxious-Shy 

ADHD Index 

Perfectionism 

Conners’ Global Index 


Separate norms are provided for boys and girls in 
3-year intervals, and the standardized data are 
based on the means and standard deviations for 
groups of children with ADHD and children with- 
out psychological problems. The Global Index is a 
general index of overall adjustment that is useful in 
treatment monitoring. The 80-item CPRS-R re- 
quires 15 to 20 minutes for completion. 

On all the Conners scales, informants rate symp- 
toms on a four-point scale (0-3). The format of the 
various Conners scales is of the following nature: 


Not Just 
at a 
all little 


Pretty Very 
much much 
Cries easily 
Restless and 
fidgety 
Acts without 
thinking 
Disobeys adults 
Gets into trouble 
Daydreams 


Trites, Blouin, and LaPrade (1982) conducted a 
factor analysis of an earlier version of the Conners’ 
using a large stratified random sample of 9,583 Cana- 
dian schoolchildren. The results yielded six factors. 
The first factor—by far the most prominent factor— 
was hyperactivity, which loaded on 17 of the 39 items 
and accounted for 36 percent of the variance. The hy- 
peractivity factor also possessed excellent internal 
consistency (coefficient alpha of .94). In several other 
studies of childhood behavior checklists, hyperactiv- 
ity emerges as the first and most robust factor when 
scale items are factor analyzed (Trites, 1979). 

Validity evidence for the Conners scales is sub- 
stantial and includes the following (Martens, 1992): 


e Scale scores show appropriate changes when hy- 
peractive children are treated with drugs known 
to improve attention. 

¢ The rating scales possess strong, positive corre- 
lations with other rating scales, peer ratings, and 
independent observations. 

e Scale scores discriminate appropriately among 
diagnostic groups. 


In general, a voluminous research base supports the 
validity of these instruments. 

Test publishers have released an almost dizzy- 
ing array of checklists for ADHD and other child- 
hood behavior problems in recent years. Most of 
these instruments are designed for use by parents 
and teachers in the context of school-based assess- 
ment. For example, Achenbach (1991, 1992) has 
published revised versions of his parent-informant 
Child Behavior Checklist (CBCL), a highly re- 
spected instrument that assesses social problems 
and academic competence in a wide variety of be- 
havioral domains. We have summarized several 
newer, more restricted instruments in Table 10.7. 


ASSESSMENT OF EMOTIONAL 
AND BEHAVIORAL DISORDERS 


While most children are carefree and enjoy going 
to school, every teacher can cite cases similar to the 
following: 


Peter was a nine-year-old third grader whose acade- 
mic performance was erratic. His parents expressed 
concern that he was anxious and withdrawn at home. 


Du eis 
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TABLE 10.7 Domains Assessed by Rating Scales for Attention Deficit 


and Related Disorders 


Scales 
Domains ACTeRS ADDES ADHDT CAAS 

Inattention ï * * * 
Conduct/Aggressiveness * 
Hyperactivity hi * $ * 
Impulsivity # * * 
Oppositional * 

Social Skills x 





ACTeRS = ADD-H: Comprehensive Teacher’s Rating Scale-2nd Edition (Ullman, Sleator, & Sprague, 1988). 
ADDES = Attention Deficit Disorders Evaluation Scales (McCarney, 1989), 


ADHDT = Attention-Deficit Hyperactivity Disorder Test (Gilliam, 1994). 


CAAS = Children’s Attention and Adjustment Survey (Lambert, Hartsough, & Sandoval, 1990). 


His teacher noted certain “odd” social behaviors such 
as never looking other children in the eye, screaming 
and making odd sounds on the playground, and ap- 
pearing too eager to please other children in games. 
Peter also seemed excessively concerned about keep- 
ing his books, papers, and pencils rigidly ordered on 
his desk. He would spend many minutes each day 
arranging and rearranging these materials. 


Is something wrong with Peter? What is the role of 
the school psychologist in dealing with the appar- 
ent emotional problems of this child? 

In addition to assessment for learning disability, 
school psychologists also perform evaluations to 
determine whether children meet the criteria for a 
serious emotional or behavioral disturbance. Be- 
ginning in 1975 with Public Law 94-142, the U.S. 
Congress stipulated that children with serious emo- 
tional or behavioral disorders were eligible for 
special services funded indirectly by the federal 
government. Hence, for purposes of identification, 
funding, and treatment, school psychologists need 
to determine whether designated children are emo- 
tionally disturbed or behaviorally disordered. 

The federal law classifies these children as “se- 
riously emotionally disturbed” (SED). The process 
by which a student is identified as SED involves in- 
terviews with parents and teachers, evaluation with 
behavior rating scales, and direct classroom obser- 
vation. The goal is to identify children.who exhibit 


inappropriate behaviors, feelings, or patterns of 
social interaction. We focus here upon the role of be- 
havior rating scales in this process, because they pro- 
vide a relatively objective approach to assessment. 

In a child behavior rating scale, key informants 
such as parents and teachers are asked to rate a child 
on relatively discrete behaviors such as likes to be 
alone, gets in fights, talkative, accident-prone, gets 
along with others, and the like. The ratings can be 
dichotomous (yes-no) but more commonly are along 
a three- or four-point continuum (e.g., never, occa- 
sionally, frequently, always). The items are grouped 
into factor-analytically derived scales that yield per- 
centiles or other scores in reference to standardiza- 
tion samples of reasonably normal (i.e., nonreferred) 
children. Several dozen behavior rating scales have 
been developed according to this strategy. A few of 
the most widely used instruments of this nature are 
identified in Table 10.8. We focus here upon a tool 
with extensive empirical underpinnings, the Child 
Behavior Checklist (Achenbach, 1991). 


Child Behavior Checklist 


The Child Behavior Checklist (CBCL) is one of the 
most carefully designed and thoroughly developed 
scales in all of clinical psychology. Actually, the in- 
strument comes in several different forms depend- 
ing upon the age of the child and whether it is to be 
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TABLE 10.8 A Brief Listing of Representative 
Child Behavior Rating Scales 


Personality Inventory for Children-2 (PIC-2) 
(Lachar & Gruber, 2001) 

Suitable for children 5 through 19 years of age, the 

PIC-2 consists of 275 true-false statements that are com- 

pleted by a parent or parental surrogate. The 9 scales 

and 21 subscales provide a comprehensive review 

of child functioning in social skills, family relations, 

cognitive skills, and psychological adjustment. 


Behavior Assessment System for Children 
(Reynolds & Kamphaus, 1992) 
This scale is an omnibus instrument that includes parent, 
teacher, and child versions; the scale yields scores in 
broad externalizing and internalizing domains as well as 
in specific content areas, including aggression, hyperac- 
tivity, conduct problems, attention problems, depression, 
anxiety, withdrawal, somatization, and social skills. 


Home Situations Questionnaire 

(Barkley & Edelbrock, 1987) 
This scale consists of 16 items pertaining to home situ- 
ations in which noncompliant behavior may occur (e.g., 
child is asked to do chores); parents rate each item on a 
nine-point scale. 


Social Skills Rating System 

(Gresham & Elliott, 1990) 
Available in parent, teacher, and self-rating forms, this 
is a 55-item questionnaire that provides specific infor- 
mation in three domains (social skills, problem behav- 
iors, and academic competence). 





filled out by parents or teachers. We restrict our dis- 
cussion here to the CBCL/4-18, suitable for par- 
ents’ reports of competencies and problems in 
children ages 4 through 18 (Achenbach, 1991). The 
origin of this multidimensional tool dates back to 
1966 when Achenbach analyzed over 600 clinical 
case histories of children to identify discrete symp- 
toms that were relatively easy to observe and also 
general as opposed to excessively specific. Further 
research and consultation resulted in the 100+ 
items that comprise the behavior-problem portion 
of the CBCL, These items are rated 0 (not true as 
far as you know), 1 (somewhat or sometimes true), 


or 2 (very true or often true). In addition, the in- 
strument includes items that tap social competency 
in three broad categories: activities, social, and 
school. 

Based upon numerous factor-analytic studies, 
Achenbach discovered that parents tend to portray 
children’s problems along eight dimensions: Ag- 
gressive Behavior, Anxious/Depressed, Attention 
Problems, Delinquent Behavior, Social Problems, 
Somatic Complaints, Thought Problems, and With- 
drawn. These scale patterns were derived from fac- 
tor analysis of ratings for 4,455 clinically referred 
children. Results are reported as percentiles in 
comparison to a normative sample of 2,368 nonre- 
ferred children. In addition to individual scale 
scores, the CBCL yields an Internalizing score 
(problems are internalized), an Externalizing score 
(problems are externalized), and a Total Problem 
score (reflective of overall maladjustment). 

Consider the test results for Peter, the troubled 
third grader described previously. One valuable 
feature of the CBCL is that ratings for both parents 
can be compared as a kind of reliability check. The 
percentile ranks for the ratings from his mother and 
father, respectively, are listed here, with higher 
scores indicating a greater problem: 


Aggressive Behavior 52, 65 
Anxious/Depressed 98, 91 
Attention Problems 98, 98 
Delinquent Behavior 34, 45 
Social Problems 93,91 
Somatic Complaints 88, 89 
Thought Problems 98, 77 
Withdrawn 95,95 
Total Problem score 98, 98 


Overall the results are within the clinical range 
(Total Problem score at the 98th percentile). The in- 
dividual scales indicate that Peter is perceived as 
highly withdrawn, anxious/depressed, with pos- 
sible thought problems (e.g., odd or peculiar 
thoughts) and a distinct difficulty paying attention. 
Further assessment by the school psychologist re- 
vealed an average-range intelligence (Full Scale IQ 
of 94) with a huge discrepancy between verbal abil- 
ity (VIQ of 110) and performance ability (PIQ of 


79). In interview, Peter revealed loose, disordered 
thinking and marked distractibility. Overall, the re- 
sults indicated that he qualified as SED and was de- 
serving of special psychological services at school. 


||| TESTING FOR GIFTEDNESS 


One ideal of Western societies is that each person 
should be educated to his or her potential. This 
ideal is not only consistent with prevailing egali- 
tarianism, it is also shrewd policy insofar as suit- 
able education of very gifted children pays huge 
dividends for society in general. It is the gifted in- 
dividuals who develop original and effective scien- 
tific concepts; discover cures for the ailments of 
humankind; produce great works of art; and invent 
new and useful products, tools, and machines. The 
early identification of gifted children is essential if 
we are to nurture their talents for the benefit of all.! 

The designation of a person as gifted typically 
means that he or she has extraordinary ability in 
some area (Horowitz, 1994). Renzulli (1986, 2002) 
notes that within this general definition, scholars 
have pursued two broad categories of giftedness. 
These might be referred to as “schoolhouse gifted- 
ness” and “creative-productive giftedness.” The 
first variety is typified by students who excel at tra- 
ditional academic pursuits such as writing, mathe- 
matics, or the sciences. These children would be 
described as possessing intellectual giftedness, Re- 
garding this variety of talent, Renzulli (1986) notes: 


It exists in varying degrees; it can be identified 
through standardized assessment techniques; and we 
should therefore do everything in our power to make 
appropriate modifications for students who have the 
ability to cover regular curricular material at ad- 
vanced rates and levels of understanding. (p. 57) 


This kind of giftedness is easily assessed by IQ or 
other ability and achievement tests. The second cat- 


1. As this is a book on psychological testing, there is not room 
to analyze trends in funding for the education of gifted individ- 
uals. Yet it is discouraging to note that in the 1990s less than 
one-tenth of I percent of all the federal funds spent on ele- 
mentary and secondary education went to programs for gifted 
children (Irwin, 1992). 
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egory of giftedness, creative-productive giftedness, 
is more difficult to-evaluate. The identification of 
talent within this domain rests upon more subjective 
procedures—few tests are suited to this purpose. 

Perhaps this is one reason why school systems 
often restrict the concept of giftedness to the intel- 
lectual realm and rely upon standardized tests for the 
identification of eligible children. The point at which 
intellectual ability is high enough to classify a child 
as mentally superior or gifted is, of course, a some- 
what arbitrary decision. A typical approach is to re- 
serve the label of giftedness (and access to enriched 
educational opportunities) for students scoring in the 
top 1 percent on standard intelligence tests such as 
the Stanford-Binet or Wechsler scales. This trans- 
lates to an IQ of about 135 and above. In summary, 
one approach to the identification of talented chil- 
dren involves teacher or parent nomination, admin- 
istration of an appropriate individual IQ test, and 
then selection of children for enriched educational 
experiences based upon a sufficiently high test score. 

An extension of the.test-based approach to the 
identification of schoolhouse or intellectual gifted- 
ness is the use of a quantitative rating system that 
incorporates test data, grades, and teacher recom- 
mendations (Table 10.9). This method provides a 
more stable platform for the identification of intel- 
lectual talent. 


The Creative-Productive 
Conception of Giftedness 


Based on suspicions that traditional tests rely too 
heavily on specific knowledge, experience, and 
content, educators have proposed an alternative ap- 
proach to the definition and identification of tal- 
ented children. Creative-productive giftedness 
refers to children (and adults) who excel in the de- 
velopment of original products and materials 
(Sternberg & Zhang, 1995). Finding these children 
mandates that we must “look below the top 3-5% 
on the normal curve of IQ scores” (Renzulli, 1986, 
p. 58). In fact, few tests are useful in locating cre- 
ative-productive giftedness. The identification of 
these children relies heavily upon the subjective 
judgment of authorities (including teachers and 
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TABLE 10.9 Sample Guidelines for the Identification of Intellectual Giftedness 


Category 


Intelligence Test Data 


Achievement Test Data 


School Grades 


Teacher Rating 


Score Range Weighted Value* 


S 


95th—99th percentile 
92-94 

89-91 

86-88 

80-85 

95th—99th percentile 
92-94 

89-91 

86-88 

80-85 

A or 96-99 or Superior 
B or 91-95 or Very Good 
C or 80-90 or Good 
Most Promising in Class 
Excellent Student 
Above Average 

Average 


NUR UAN BUOXA SI 00 an N I © 





*Maximum possible score = 31, and 28 points are needed to identify a student as gifted. 


Source: Based on Gallagher, J. J., & Courtright, R. D. (1986). The educational definition of giftedness and 
its policy implications. In R. J. Sternberg & J. E. Davidson (Eds.), Conceptions of giftedness. Cambridge: 


Cambridge University Press. 


psychologists), who must apply conceptual defini- 
tions of giftedness to the specific circumstances of 
individual children. 

The view of giftedness offered by Renzulli 
(1978, 1986, 2002) serves to illustrate this alterna- 
tive approach. He downplays the role of IQ, defin- 
ing giftedness instead as the confluence of three 
elements: 


1. Above-average ability 
2. Evidence of creativity 
3. Evidence of task commitment 


The first element specifies above-average but not 
necessarily superior or gifted ability. According to 
this view, a gifted child is one with above-average 
general ability (say an IQ of 115 or higher) or 
recognized talent in a specific domain. Examples 
of specific domains include chemistry, ballet, 
mathematics, musical composition, sculpture, and 


photography. While general ability and many spe- 
cific talents can be measured with standardized 
tests, some areas such as the arts must be evaluated 
through performance-based assessment techniques. 

The second essential element is that gifted chil- 
dren reveal flashes of creativity in their activities. 
But what is creativity and how can it be assessed 
as an element of creative-productive giftedness? 
The assessment of creativity has fascinated psy- 
chologists and educators for decades—and it 
has also proved to be a vexing problem. Although 
researchers generally acknowledge that creativity 
is something different from intelligence, beyond 
this fundamental point there is little agreement 
as to the nature or assessment of creativity (Wal- 
lach, 1985). 

Over the years creativity has been defined as a 
process, a personal characteristic, and a behavioral 
product (Amabile, 1983). An example of a process 


view is the idea that creative persons excel at a spe- 
cific cognitive process called divergent thinking: 


Divergent thinking is defined as the kind that 
goes off in different directions. It makes pos- 
sible changes of direction in problem solving 
and also leads to a diversity of answers, where 
more than one answer may be acceptable. (Guil- 
ford, 1959) 


An illustration of a measure of divergent think- 
ing is the Consequences Test (Guilford & Hoepf- 
ner, 1971). Examples include: “What would be the 
consequences if clouds had strings hanging down 
from them?” or “What would be the consequences 
if lightbulbs cost $10 each?” or “What would be 
the consequences if the oceans rose by 10 feet?” 
The sheer number of answers given is considered 
an index of divergent thinking, which, in turn, is 
considered evidence of creativity. Tests of diver- 
gent thinking have enjoyed periodic popularity, 
but their value as measures of creativity remains 
questionable. 

Another view of creativity is that personal traits 
signify its likely presence. According to this per- 
spective, creativity flows from the temperament, 
motivation, and character of the individual. This 
would suggest that there is a creative personality. 
Harrington (1975) has captured a not altogether flat- 
tering portrait of the creative person in his Com- 
posite Creative Personality scale. This test is an 
adjective checklist: creative persons are distin- 
guished by self-rated traits including argumentative, 
assertive, hurried, insightful, rebellious, sponta- 
neous, and versatile, among others. This line of 
research indicates that creative persons are distin- 
guished by interests, attitudes, and motivations, and 
not by intellectual ability alone. Yet the link is indi- 
rect and imperfect. As a consequence, personality 
measures rarely aid in the identification of talented 
students. Recently, Sternberg (2002) has proposed 
that creative individuals are distinguished not so 
much by specific traits as by the heartfelt decision 
to be creative: 


I believe that, although creative people differ in an 
astonishing number of ways, there is, in fact, one 
key attribute that they all possess. . . . This attribute 
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is the decision to be creative. People who create 
decide that they will forge their own path and fol- 
low it, for better or for worse. The path is a difficult 
one because people who defy convention often are 
not rewarded. (p. 376) 


This perspective suggests that creative students will 
be characterized by a stubborn dedication to their 
creative products, even when rewards for their ac- 
tivities seem to be lacking. 

The final approach to creativity uses the prod- 
uct as the distinguishing sign of this capacity. 
According to this view, creative persons produce 
things (ideas, inventions, writings, artistic outputs, 
etc.) that meet certain criteria. For example, Jack- 
son and Messick (1968) apply four criteria to a cre- 
ative product: j 


Novelty: Creative products are new, or at least 
represent a new application of the familiar. 
Appropriateness: The product must be appropri- 
ate to the context, not merely novel. 
Transcendence of constraints: A product tran- 
scends when it goes beyond the traditional. 
Coalescence of meaning: The value of creative 
products may not be apparent at first; the full sig- 
nificance may be appreciated only with time. 


These criteria can be used to identify children who 
show promise of creativity, which is one element of 
creative-productive giftedness. Subjective judg- 
ment is needed to identify these elements of a 
creative product—no tests or rating scales are avail- 
able for this purpose. 

The third essential element in creative-produc- 
tive giftedness is evidence of task commitment. In 
the dedication and passion for their pursuits, gifted 
children astonish parents and teachers alike. These 
talented children willingly spend as much time in 
quest of their giftedness as peers spend watching 
television (Renzulli, 1986). In case studies of tal- 
ented children, qualities of persistence, endurance, 
engagement, perseverance, hard work, dedication, 
and self-confidence are mentioned over and over 
(Terman & Oden, 1959). The assessment of task 
commitment requires a subjective approach, al- 
though one operational specification might be that 
the eligible candidate must have shown a passion 
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for his or her area of giftedness over a specific pe- 
riod of time (say, at least for one year or more). 
Recently, several scholars have expressed con- 
cern that the notion of giftedness has been defined 
from a European-centered cultural perspective such 
that many minority students remain unrecognized 
(Hamilton, 1993; Maker, 1996). The heavy reliance 
upon test scores is considered particularly prob- 
lematic insofar as existing tests are questionable 
predictors of success in nonacademic settings— 


especially for ethnic, cultural, and linguistic | 


minority groups. As a remedy, Maker (1996) rec- 
ommends that educators use identification prac- 
tices that are process-oriented and based upon 
performance (as opposed to test scores). She de- 
veloped an authentic approach for identification of 
gifted minority children, called DISCOVER, in 
which groups of elementary students solve prob- 
lems with blocks, tangrams, puzzle books, and 
toys: 


The intent is to create a problem solving situation 
similar to what might occur in a classroom appro- 
priate for gifted students from varied cultural, eth- 
nic, economic, and linguistic backgrounds. (p. 45) 


The children rotate through three or more activities 
while observers document problem-solving pro- 
cesses and skills. The observers then fill out an 82- 
item checklist for purposes of identifying gifted 
students. Because the children use concrete mate- 
rials, they can express what Gardner (1992) refers 
to as “first order” knowledge, which involves the 
creation and understanding of stories, music, draw- 
ings, constructions, and explanations. This type of 
knowledge is less dependent on academic learning 
and proficiency in the symbol systems taught in 
school and is therefore considered a more accurate 
index of the abilities of children from diverse back- 
grounds. The DISCOVER approach reminds us 
that giftedness can be viewed from many perspec- 
tives—it is not necessarily one thing. 


SUMMARY 


1. Screening for school readiness is one 
important function of assessment within school sys- 
tems. A useful screening test should yield false-neg- 
ative rates of less than 20 percent and false-positive 
rates of less than 10 percent. Unfortunately, existing 
screening instruments rarely meet these criteria. 


2. Useful instruments for preschool screening 
include the Developmental Indicators for the As- 
sessment of Learning-III (DIAL-IID, which as- 
sesses motor skills, cognitive concepts, and 
language skills; the Denver II, which assesses de- 
velopment in four areas—personal-social, fine 
motor—adaptive, language, and gross motor; and 
the Home Observation for the Measurement of the 
Environment (HOME). 


3. HOME is an index of the child’s environ- 
ment based upon in-home observation and inter- 
view with the primary caretaker. The inventory 
measures the quality and quantity of stimulation 
and support for cognitive, social, and emotional de- 
velopment available to the child in the home. 


4. In the short run, there is probably no better 
index of a child being at-risk for school failure than 
an individual test of intelligence such as the 
Stanford-Binet: Fourth Edition (SB:FE) or the Dif- 
ferential Ability Scales (DAS). 


5. Based on Public Law 101-476, an exten- 
sion of Public Law 94-142, the federal definition of 
learning disabilities (defined as a significant dis- 
crepancy between intelligence and achievement) 
has lost favor with experts in the LD field. A newer 
definition refers to an intraindividual weakness in 
one or more of the core areas of academic func- 
tioning (listening, speaking, reading, writing, rea- 
soning, or mathematical abilities) as the essential 
feature of LD. 


6. There is general agreement—with oc- 
casional dissenting votes—on the following fea- 
tures of learning disabilities. A learning disability 
involves an intraindividual discrepancy in cogni- 
tive functioning; an exclusion of other disabling 
conditions as the primary cause; heterogeneity, 


that is, the existence of many different subtypes; 
a developmental continuity from childhood into 
adulthood; and a high incidence of social and emo- 
tional consequences. 


7. Representative of individual achievement 
tests used in the assessment of LD is the Kaufman 
Test of Educational Achievement (K-TEA), an un- 
timed test for children ages 6 through 18. The K- 
TEA consists of five subtests: Reading Decoding, 
Reading Comprehension, Mathematics Applica- 
tions, Mathematics Computation, and Spelling. 


8. K-TEA subtest scores can be converted to 
grade equivalents and standard scores with mean of 
100 and SD of 15. A qualitative error analysis is 
also possible for subtests, which improves the ed- 
ucational utility of the K-TEA. Standardization, re- 
liability, and content validity appear to be good, 
although additional test-retest studies would be 
desirable. 


9. Because Attention-Deficit/Hyperactivity 
Disorder (ADHD) often coexists with learning dis- 
ability, practitioners need tools and concepts for as- 
sessment of ADHD. The DSM-IV criteria for 
ADHD stress fidgeting, distractibility, impulsivity, 
attentional deficits, poor social skills, and not con- 
sidering consequences. Others emphasize deficien- 
cies in delayed and/or sustained responding as 
diagnostic symptoms. 
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10. Conners has developed a family of rating 
scales for identifying hyperactivity and other be- 
havioral problems in children. Teachers, parents, 
and caretakers rate symptoms on a four-point scale. 
These instruments and other rating scales such as 
Achenbach’s Child Behavior Checklist are useful 
adjuncts in the assessment of problematic behavior 
in children. 


11. A child with serious emotional disturbance 
(SED) exhibits inappropriate behaviors, feelings, 
or patterns of social interaction. Objective scales 
such as the Child Behavior Checklist (CBCL) are 
helpful in the assessment of SED. The items on the 
CBCL are rated 0 (not true), 1 (somewhat or some- 
times true), or 2 (very true or often true) by one or 
both parents. 


12. Another function of the school psycholo- 
gist is testing for giftedness, which refers to extra- 
ordinary ability in some area. Eligible children are 
often identified by a high IQ on an individual in- 
telligence test, but other approaches can be used. 
Some experts refer to above-average intelligence, 
creativity, and task commitment as the important 
ingredients of giftedness. This approach relies 
heavily upon the subjective judgment of authorities 
(administrators, psychologists) for the identifica- 
tion of gifted children. 


KEY TERMS AND CONCEPTS 


screening p. 352 
sensitivity p. 353 
specificity p. 353 
learning disability p. 359 


Attention-Deficit/Hyperactivity Disorder p. 369 
gifted p. 373 

creativity p. 374 

divergent thinking p. 375 


Topic 10B Forensic Applications of Assessment 


Standards for the Expert Witness 

Evaluation of Suspected Malingering 
Assessment of Mental State for the Insanity Plea 
Competency to Stand Trial 
Prediction of Violence and Assessment of Risk 
Evaluation of Child Custody in Divorce 

Personal Injury and Related Testimony 
Interpretation of Polygraph Records 

Controversy over the Psychologist as Expert Witness 


Summary 


Key Terms and Concepts 


Poros and the legal system have had a 
long and uneasy. alliance characterized by 


mistrust on both sides. Within the legal system, 
lawyers and judges maintain antipathy toward the 
testimony of psychologists because of a concern 
that their opinions are based upon “junk science” 
(or perhaps no science at all) and also because of a 
belief (not entirely unfounded) that some expert 
witnesses will profess almost any viewpoint that 
serves the interests of a defendant. Within the men- 
tal health profession, psychologists find the adver- 
sarial aspect of courtroom testimony—based upon 
the expectation of yes-no opinions expressed as vir- 
tual certainties—to be an impossible arena in which 
to pursue the truth about human behavior. As the 
reader will discover, this essential tension between 
law and psychology is a constant backdrop that 
shapes and informs the nature of psychological 
practice in the courtroom. 

For better or for worse, psychologists do testify 
in court cases, and the focus of their testimony 
often pertains to the interpretation of psychological 
tests and assessment interviews. When are test re- 
sults and psychological opinions based upon them 
admissible in court? What criteria do judges use in 
determining whether to admit psychological testi- 
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mony? Psychologists who represent themselves as 
experts and who use tests to justify their opinions 
must have a firm grounding in legal issues that per- 
tain to assessment. In this topic we examine the rel- 
evance of legal standards to testimony based upon 
psychological tests and evaluations. We also ex- 
plore a few specialized instruments useful in foren- 
sic assessment. 

The role of the psychological examiner can 
intersect with the legal system in a multitude of 
ways. The practitioner might be called upon for the 
following: 


e Evaluation of possible malingering 

e Assessment of mental state for the insanity plea 
Determination of competency to stand trial 
Prediction of violence and assessment of risk 
Evaluation of child custody in divorce 
Assessment of personal injury 

Interpretation of polygraph data 


These are the primary applications of forensic prac- 
tice, which we examine here. A variety of addi- 
tional applications are surveyed in Melton, Petrila, 
Poythress, and Slobogin (1998). 

In addition to meeting the general guidelines 
for ethical practice required of any clinician, prac- 
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titioners who offer expert testimony based upon 
psychological tests will encounter additional stan- 
dards of practice unique to the U.S. jurisprudence 
system. We summarize major concerns regarding 
psychological tests and courtroom testimony here. 
The reader can find extended discussions of this 
topic in Melton et al. (1998) and Wrightsman, 
Nietzel, Fortune, and Greene (2002). 

Each of the previously listed topics raises 
unique questions about the role of the psychologist 
in the courtroom. However, one issue is common to 
all forms of courtroom testimony: When is a psy- 
chologist an expert witness? We discuss this gen- 
eral issue before returning to specific applications 
of psychological evaluation that intersect with the 
U.S. legal system. 

STANDARDS FOR THE 

EXPERT WITNESS 
Just as psychologists are concerned with issues 
of standards and competence, so too are lawyers 
and judges. U.S. jurisprudence has developed 
various guidelines for courtroom testimony, in- 
cluding several general principles regarding the tes- 
timony of an expert witness. These standards are 
found in Federal Rules of Evidence (1975) and 
have been upheld by various court decisions. We 
can summarize the principles of expert testimony 
as follows: 











The witness must be a qualified expert. Not all 
psychologists who are asked to testify will be al- 
lowed to do so. Based on a summary of the ex- 
pert’s education, training, and experience, the 
judge decides whether the testimony of the wit- 
ness is to be admitted. 

The testimony must be about a proper subject 
matter. In particular, the expert must present in- 
formation beyond the knowledge and experience 
of the average juror. 

The value of the evidence in determining guilt 
or innocence must outweigh its prejudicial effect. 
For example, if the expert’s testimony might con- 
fuse the issue at hand or might prejudice the mem- 
bers of the jury, it is generally not admissible. 


¢ The expert’s testimony should be in accordance 
with a generally accepted explanatory theory. In 
most courts, guidance on this matter is provided 
by Frye v. United States, a 1923 court case per- 
taining to the admissibility of expert testimony. 


In Frye v. United States, the counsel for a mur- 
der defendant attempted to introduce the results of 
a systolic blood pressure deception test. The lawyer 
offered an expert witness to testify to the result of 
the deception test. It was asserted that emotionally 
induced activation of the sympathetic nervous sys- 
tem causes systolic blood pressure to rise gradually 
if the examinee attempts to deceive the examiner. 
In other words, the expert witness asserted that in 
the course of an interrogation about a crime, the pat- 
tern of change in systolic blood pressure could be 
used as a form of lie detector test. The defense coun- 
sel wanted their expert witness to testify in support 
of the client’s innocence. Counsel for the prosecu- 
tion objected, and the Court of Appeals of the Dis- 
trict of Columbia upheld the objection, ruling: 


While courts will go a long way in admitting expert 
testimony deduced from a well-recognized scien- 
tific principle or discovery, the thing from which 
the deduction is made must be sufficiently estab- 
lished to have gained general acceptance in the 
particular field in which it belongs. (cited in Blau, 
1984) 


The court concluded that the systolic blood pres- 
sure deception test had not gained acceptance 
among physiological and psychological authorities 
and therefore refused to allow the testimony of the 
expert witness. 

According to these guidelines, a test, inventory, 
or assessment technique must have been available 
for a fairly long period of time in order to have a 
history of general acceptance. For this reason, the 
prudent expert witness will choose well-estab- 
lished, extensively researched instruments as the 
basis for testimony, rather than relying upon re- 
cently developed tests that might not stand up to 
cross-examination under the constraints of Frye v. 
United States. 

In the mid- to late 1990s, the standards for ex- 
pert testimony were refined further, beginning with 
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a Supreme Court decision in Daubert v. Merrell 
Dow Pharmaceuticals (1993). The Court’s written 
opinion added extensive guidelines about factors to 
be considered in weighing scientific testimony in 
trials. Two additional court cases (General Electric 
Co. v. Joiner, 1997; Kumho Tire Co., Ltd. v. 
Carmichael, 1999) further extended the parameters 
of expert testimony defined by Daubert. Some- 
times known as the Daubert trilogy, these three 
cases generated several new guidelines that trial 
judges may use in determining the admissibility of 
expert testimony (Grove & Barden, 1999): 


1. Is the proposed theory (or technique), on which 
the testimony is to be based, testable? 

2. Has the proposed theory (or technique) been 
tested using valid and reliable procedures and 
with positive results? 

3. Has the theory (or technique) been subjected to 
peer review? 

4. What is the known or potential error rate of the 
scientific theory or technique? 

5. What standards, controlling the technique’s op- 
eration, maximize its validity? 

6. Has the theory (or technique) been generally 
accepted as valid by a relevant scientific com- 
munity? 

7. Do the expert’s conclusions reasonably follow 
from applying the theory (or technique) to this 
case? 


The ramifications of the Daubert trilogy rulings for 
the expert testimony of psychologists are unclear at 
this time. For example, it is uncertain whether tes- 
timony based upon the Rorschach Inkblot Test (dis- 
cussed later in this text) would be admissible under 
these newer, more restrictive guidelines (Grove, 
Barden, Garb, & Lilienfeld, 2002; Ritzler, Erard, & 
Pettigrew, 2002). What is clear at this point is that 
judges generally have tightened the standards for 
admitting expert evidence in U.S. courts (Dixon & 
Gill, 2002). For example, some courts have used 
the Daubert ruling as a basis for denying testimony 
from mental, health professionals, including psy- 
chologists. In some courts, testimony about psy- 
chological evaluations of sexually abused children 
has been ruled inadmissible. Increasingly, courts 


will demand that testimony from psychologists has 
a strict scientific basis (Melton et al., 1998). 

EVALUATION OF SUSPECTED 

MALINGERING 
In most settings, a psychologist safely can assume 
that clients will be reasonably honest about their 
mental and emotional state. Clients want to tell 
their stories and they want to get things right. At 
worst, they may overstate symptoms slightly so as 
to impress the clinician that help truly is deserved 
and needed. Yet outright deception and manipula- 
tion are uncommon—for the simple reason that 
clients rarely have incentive for these strategies. 

However, the rules of clinical engagement are 
turned upside-down in forensic settings. The typi- 
cal forensic client has much to gain from a case for- 
mulation that emphasizes illness and disability. 
Indeed, the context of the assessment almost guar- 
antees that clients will seek to look “crazy” or dis- 
abled, whether by exaggeration or (more rarely) 
deceptive design. In the mind of the forensic client, 
fabrication of symptoms may serve to excuse un- 
acceptable behavior (e.g., favoring the insanity 
plea), sway sentencing recommendations (e.g., 
against capital punishment), or gain entitlements 
(e.g., certification for disability). These client ma- 
neuvers clearly influence the validity of forensic as- 
sessments. Hovering in the background of every 
forensic assessment is this troubling question: Was 
the client reasonably honest and forthright? 

The forensic examiner must make a judgment 
about the honesty of the client’s self-portrayal dur- 
ing the evaluation. And yet while common sense 
dictates that the examiner should expect some de- 
gree of deception, the conclusion that a client has 
consciously malingered needs to be reached with 
caution: 








Given the significant potential for deception and the 
implications for the validity of their findings, mental 
health professionals should develop a low threshold 
for suspecting deceptive responding. At the same 
time, because the label of “malingerer” may carry 
considerable weight with legal decisionmakers and 
potentially tarnish all aspects of the person’s legal 
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position, conclusions that a person is feigning 
should not be reached hastily. (Melton et al., 1998) 


The most common and venerable method for 
identifying dishonest clients is the clinical inter- 
view. However, a more objective approach should 
be preferred. The assessment of potential malin- 
gering with interview hinges upon the judgment of 
the clinician (e.g., “This client is inconsistent in his 
presentation of symptoms and appears eager to be 
sick, so I conclude that he is malingering”), which 
may prove erroneous. In contrast, an objective ap- 
proach provides normative data, hit rates, and the 
like for the evaluation. Not only might this improve 
the accuracy of the assessment, in addition a more 
standardized approach should find greater accept- 
ability in many court systems. 

Unfortunately, there are relatively few objective 
approaches for the assessment of malingering in 
forensic clients. One promising instrument is the 
Structured Interview of Reported Symptoms (SIRS), 
a 172-item interview schedule designed expressly 
for the evaluation of malingering (Rogers, Bagby, 
& Dickens, 1992). The approach embodied in the 
SIRS was based upon strategies identified in the 
clinical literature as potentially useful for detecting 
malingering. Using a structured interview method, 
malingering is assessed on eight primary scales: 


Rare Symptoms (overreporting of infrequent 
symptoms) 

Symptom Combinations (real psychiatric symp- 
toms that rarely occur together) 

Improbable or Absurd Symptoms (symptoms re- 
veal a fantastic quality) 

Blatant Symptoms (overendorsement of obvious 
signs of mental disorder) 

Subtle Symptoms (overendorsement of everyday 
problems) 

Severity of Symptoms (symptoms portrayed with 
extreme, unbearable severity) 

Selectivity of Symptoms (indiscriminant en- 
dorsement of psychiatric problems) 

Reported versus Observed Symptoms (compari- 
son of observed and reported symptoms) 


In addition to the eight primary scales, five supple- 
mentary scales are used to interpret response styles. 


Of the. 172 questions, 32 are repeated inquiries to 
detect inconsistency of responding. Examples of the 
kinds of structured interview questions include: “Do 
you ever feel like the fillings in your teeth can pick 
up radio messages?” (Rare Symptoms); “Do you 
have severe headaches at the same time as you have 
a fear of germs?” (Symptom Combinations); “Does 
the furniture where you live seem to get bigger or 
smaller from day to day?” (Improbable or Absurd 
Symptoms). “Do you have any serious problems 
with thoughts about suicide?” (Blatant Symptoms). 
The scale takes less than an hour to administer. 

Results allow for classification of examinees as 
definite feigning, probable feigning, and honest. 
Reliability of the instrument is good, with internal- 
consistency reliability coefficients for subscales 
ranging from .66 to .92. Interrater reliability. esti- 
mates are superb, ranging from .89 to 1.00. 

Although the validity of the SIRS can be dis- 
cussed along the familiar lines of content, criterion- 
related, and construct validity (and the test performs 
well in these domains), the real measure of its clini- 
cal utility pertains to the capacity of the test to dis- 
criminate known or suspected malingerers from 
psychiatric patients and normal controls. One recent 
study indicates that the test performs well in this 
capacity (Gothard, Viglione, Meloy, & Sherman, 
1996). In a mixed sample of 125 males referred for 
competency evaluation (including 30 persons asked 
to simulate malingering, 7 individuals strongly sus- 
pected of malingering, and 88 persons for whom 
malingering appeared unlikely), the SIRS was over- 
all 97.8 percent accurate in classifying participants 
as malingered or nonmalingered. Recently, Heinze 
and Purisch (2001) have shown that the SIRS and 
other similar tests provide more accurate detection 
of malingering when used collectively rather than 
individually. 

A few other studies reveal promising results, 
but these involve predominantly white and edu- 
cated samples (Rogers, Gillis, Dickens, & Bagby, 
1991; Rogers, Kropp, Bagby, & Dickens, 1992). In 
contrast, the population of the criminal justice sys- 
tem—the arena in which SIRS most likely would 
be used—is relatively uneducated, and minorities 
are heavily overrepresented. By one estimate, more 
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than 80 percent of the urban jail population in the 
United States is African American (Dixon, 1995). 
Unanswered is how well the SIRS would function 
in the detection of malingering within this sizable 
subpopulation. 


ASSESSMENT OF MENTAL STATE 
FOR THE INSANITY PLEA 


In criminal trials the defendant may invoke a vari- 
ety of defenses including entrapment, diminished 
capacity (e.g., from mental subnormality), automa- 
tism (e.g., from hypnotic suggestion), and the in- 
sanity plea. Whenever a special defense is invoked, 
an evaluation of the defendant’s mental state at 
the time of the offense (MSO) is required. In some 
courts, a psychologist is qualified to offer opinions 
about the MSO of a defendant. We restrict the dis- 
cussion here to the insanity plea since this is the 
most common doctrine that would trigger the need 
for an MSO evaluation. 

Almost everyone is familiar with the insanity de- 
fense, but only the exceptional person understands 
its provisions. Technically, the insanity defense is 
known as not guilty by reason of insanity (NGRI). 
Based on a few sensational and widely publicized 
trials such as the case of John Hinckley, who«at- 
tempted to assassinate President Ronald Reagan, the 
lay public generally has concluded that the insanity 
defense is commonly employed by cynical lawyers 
to help dangerous clients evade legal responsibility 
for heinous crimes. Nothing could be further from 
the truth. In reality, the NGRI plea is widely re- 
spected by jurisprudence experts and is invoked in 
fewer than 1 in 1,000 trials (Blau, 1984). And in this 
tiny fraction of all criminal cases, the defense suc- 
ceeds less than 1 time in 4 (Melton et al., 1998). The 
widespread belief that persons found NGRI “walk” 
away from their crimes also is inaccurate: Most re- 
ceive hospital treatment that lasts several years. Re- 
cidivism rates are perhaps lower (and certainly not 
higher) than felons convicted of similar offenses 
(Melton et al., 1998). Even though outlawed in some 
states, the insanity defense has shown remarkable 
resiliency—probably because it performs a desirable 
role in a modern and compassionate society. 


Several legal tests for insanity have had signif- 
icant influence in the United States, including the 
M’Naughten rule, the Durham rule, the Model 
Penal Code rule, and the Guilty But Mentally Ill 
(GBMI) verdict (Wrightsman et al., 2002). Some 
jurisdictions include irresistible impulse as a sup- 
plement to the M’ Naughten Rule. A few states have 
abolished the insanity defense altogether. We will 
survey the different standards briefly before com- 
menting upon the role of psychological tests in de- 
termining legal insanity. 

The M’Naughten rule is the oldest, stemming 
from an 1843 case in England. Daniel M’ Naughten 
was plagued by paranoid delusions that the prime 
minister, Robert Peel, was part of a conspiracy 
against him. M’ Naughten stalked the prime minis- 
ter and, in a case of mistaken identity, shot his male 
secretary at No. 10, Downing Street. M’ Naughten 
was found not guilty by reason of insanity, a ver- 
dict that touched off a-national furor. In response to 
the furor, Queen Victoria commanded all 15 high 
judges of England to appear before the House of 
Lords and clarify the newly forged guidelines on 
insanity. The M’ Naughten rule states: 


The jury ought to be told in all cases that every 
man is to be presumed to be sane, and to possess a 
sufficient degree of reason to be responsible for his 
crimes, until the contrary be proved to their satis- 
faction; and that to establish a defense on the 
grounds of insanity it must be clearly proved that, 
at the time of committing the act, the accused was 
laboring under such a defect of reason, from dis- 
ease of the mind, as not to know the nature and 
quality of the act he was doing, or, if he did know 
it, that he did not know what he was doing was 
wrong. (cited in Wrightsman et al., 1994) 


Thus, the M’Naughten rule “excuses” criminal be- 
havior if the defendant, as a consequence of a “dis- 
ease of the mind,” did not know what he or she was 
doing (e.g., a paranoid schizophrenic who believed 
he or she was shooting the literal devil) or did not 
know that what he or she was doing was wrong 
(e.g., a person with mental retardation who be- 
lieved that it was acceptable to shoot an obnoxious 
panhandler). Approximately half of the states use 
the M’ Naughten rule. 
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Some jurisdictions also allow “irresistible im- 
pulse” as a supplement to the M’ Naughten rule. An 
irresistible impulse is generally defined as a be- 
havioral response that is so strong that the accused 
could not resist it by will or reason. But when is an 
impulse irresistible as opposed to simply unre- 
sisted? This has proved difficult to define. For ob- 
vious reasons, legal experts are unhappy with the 
notion of irresistible impulse, and its use as part of 
an insanity plea appears to be waning. 

The Durham rule was formulated in 1954 by 
the District of Columbia Federal Court of Appeals 
in Durham y. United States. Dissatisfied with the 
M’Naughten rule, Judge David Bazelon proposed 
a new test, known as the Durham rule, which pro- 
vided for the defense of insanity if the criminal act 
was a “product” of mental disease or defect. The 
purpose of the Durham rule was to give mental 
health professionals a wider latitude in presenting 
information pertinent to the defendant’s responsi- 
bility. Legal scholars hailed Durham as a great step 
forward, but in 1972 the rule was dropped by the 
circuit that had formulated it. 

The Durham rule was replaced by the Model 
Penal Code rule proposed by the American Law In- 
stitute. Adopted in 1972, the Model Penal Code 
rule is as follows: 


A person is not responsible for criminal conduct if 
at the time of such conduct, as a result of mental 
disease or defect, he lacks substantial capacity 
either to appreciate the criminality (wrongfulness) 
of his conduct or to conform his conduct to the re- 
quirements of the law. (cited in Melton et al., 1998) 


The Model Penal Code rule also contains provisions 
which prohibit the inclusion of the psychopath or 
antisocial personality within the insanity defense. 
The Model Penal Code rule differs from the 
M’Naughten rule in three important ways: 


1. By using the term appreciate, it acknowledges 
the emotional determinants of criminal action. 

2. It does not require a total lack of appreciation by 
offenders for the nature of their conduct—only 
a lack of “substantial capacity.” 

3. It includes both a cognitive element and a voli- 
tional element, making defendants’ inability to 


control their actions an independent criterion for 
insanity (Wrightsman et al., 2002). 


About 20 states now follow the Model Penal Code 
rule or slight variants of it. 

A recent development in the insanity plea is the 
Guilty But Mentally Ill (GBMI) verdict. Approx- 
imately one-fourth of the states allow juries to 
reach a verdict of GBMI in cases in which the de- 
fendant pleads insanity. Typically, in states that 
allow the GBMI verdict, the judge instructs the jury 
to return with one of four verdicts: 


e Guilty of the crime 

e Not guilty of the crime 

e Not guilty by reason of insanity 
e Guilty but mentally ill 


The intention of the last alternative is that a defen- 
dant found GBMI should receive the same sentence 
as if found guilty of the crime, but he or she begins 
the sentence in a psychiatric hospital. After treat- 
ment is completed, the defendant then serves the 
remainder of the sentence in a prison. 

But the intention of GBMI and its reality are 
two different things. Initial support for the GBMI 
verdict as a humane variant of the insanity plea has 
waned in recent years. Wrightsman et al. (2002) 
point out that jurors express confusion when asked 
to make the difficult distinction between mental ill- 
ness that results in insanity (GBMI) and mental ill- 
ness that does not. Melton et al. (1998) find little 
virtue in the verdict: 


The GBMI verdict is conceptually flawed, has sig- 
nificant potential for misleading the factfinder, and 
does not appear to achieve its goals of reducing 
insanity acquittals or prolonging confinement of 
offenders who are mentally ill and dangerous. The 
one goal it may achieve is relieving the anxiety 

of jurors and judges who otherwise would have 
difficulty deciding between a guilty verdict and 

-a verdict of not guilty by reason of insanity. It is 
doubtful this goal is a proper one or worth the 
price. (p. 215) 


Empirical studies indicate that offenders found 
GBMI seldom receive adequate treatment. Fur- 
thermore, they may receive harsher sentences than 
their counterparts found merely guilty (Callahan, 
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McGreevy, Cirincione, & Steadman, 1992). In fact, 
some defendants found GBMI have been sentenced 
to death! 

Now that the reader has been introduced to vari- 
ants of the insanity plea, we review the role of the 
psychologist in determining legal insanity. An im- 
portant point is that psychologists are rightfully 
cautious in offering an interview-based opinion as 
to a person’s mental state at the time of a crimi- 
nal offense. After all, the crime usually occurred 
days, weeks, months, or even years before, and the 
client may be unable to assist in the accurate re- 
construction of events and mental states. Conse- 
quently, psychological testimony regarding legal 
insanity should be cautious and conservative. Reli- 
ability studies of insanity evaluations also suggest 
that caution is appropriate. In a review of seven 
studies, Melton et al. (1998) determined that inter- 
rater agreement (as to whether a defendant was 
legally insane) ranged from a low of 64 percent (be- 
tween prosecution and defense psychiatrists) to a 
high of 97 percent (between psychologists with 
forensic training who used structured instruments, 
discussed later). 

In spite of controversy over the role of the psy- 
chologist in MSO determinations, some experts 
foresee an increased role for psychological assess- 
ment in cases involving the insanity plea. In partic- 
ular, neuropsychological assessments may provide 
objective, valid data to help the courts decide the 
merits of an insanity defense. Recent court rulings 
affirm that neuropsychological test findings can be 
used to show that a defendant has impaired capa- 
bility to choose right and refrain from wrong (Blau, 
1984; Heilbrun, 1992). Martell (1992) has dis- 
cussed the relevance of neuropsychological assess- 
ment to the insanity plea as defined by the Model 
Penal Code. The Model Penal Code defines a de- 
fendant as not guilty by reason of insanity if he or 
she “lacks substantial capacity” to appreciate the 
criminality of his or her conduct. Neuropsycholog- 
ical test results have a direct bearing upon this 
issue. 

Rating scales such as the Rogers Criminal Re- 
sponsibility Assessment Scales (R-CRAS) also 
provide a useful basis for evaluating criminal re- 


sponsibility (Rogers, 1984, 1986). The R-CRAS is 
completed by the examiner immediately following 
a review of clinical records, police investigative 
reports, and the final clinical interview with the 
patient-defendant. The instrument consists of clear 
descriptive criteria for 25 items assessing both psy- 
chological and situational factors. The items are 
scored with respect to the time of the crime on five 
scales measuring these variables: 


e Patient Reliability 
e Organicity 

¢ Psychopathology 
e Cognitive Control 
¢ Behavioral Control 


The individual items on the R-CRAS were de- 
rived from the Model Penal Code standard of in- 
sanity (Table 10.10). Interrater reliabilities of the 
R-CRAS scales ranged from .48 (for a Malingering 
subscale) to 1.00 (for Organicity). Construct valid- 
ity was established by comparing the disposition of 
93 legal cases with R-CRAS data. Even though 
legal outcome is determined by many variables be- 
sides the psychological state of the person at the 
time of the crime, there was 95 percent agreement 
in the determination of sanity and 73 percent agree- 
ment in the determination of insanity. 

Even though reviewers recognize the promise 
of the R-CRAS, for some a healthy skepticism 
still prevails. One concern is that the subscales 
of the instrument represent an ordinal level of 
measurement, whereas an interval level of quanti- 
fication is implied. Another concern is that the 
test developers claim to “quantify areas of judg- 
ment that are logical and/or intuitive in nature” 
that leads to a false sense of scientific certainty 
(Melton et al., 1998). Certainly, the R-CRAS per- 
forms a valuable function by helping clinicians 
organize their thinking and evaluation. The utility 
of the overall decision—sane versus insane—will 
rest upon additional validational research (Howell 
& Richards, 1989). In support of test validity, 
Rogers and Sewell (1999) reanalyzed 413 insanity 
cases and found that the R-CRAS contributed 
substantially to the determination of criminal 
responsibility. 
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TABLE 10.10 Sample Items from the R-CRAS 





10. Amnesia about the alleged crime. 
(This refers to the examiner’s assessment of amne- 
sia, not necessarily the patient’s reported amnesia) 

(0) No information. 

(1) None. Remembers the entire event in consider- 
able detail. 

(2) Slight; of doubtful significance. The patient for- 
gets a few minor details. 

(3) Mild. Patient remembers the substance of what 
happened but is forgetful of many minor details. 

(4) Moderate. The patient has forgotten a major por- 
tion of the alleged crime but remembers enough 
details to believe it happened. 

(5) Severe. The patient is amnesic to most of the al- 
leged crime but remembers enough details to be- 
lieve it happened. 

(6) Extreme. Patient is completely amnesic to the 
whole alleged crime. 


11. Delusions at the time of the alleged crime. 

(1) No information. 

(2) Suspected delusions (e.g., supported only by 
questionable self-report). 

(3) Definite delusions but not actually associated 
with the commission of the alleged crime. 

(4) Definite delusions which contributed to, but 
were not the predominant force in, the commis- 
sion of the alleged crime. 

(5) Definite controlling delusions on the basis of 
which the alleged crime was committed. 





Source: Adapted and reproduced by special permission of the 
Publisher, Psychological Assessment Resources, Inc., Odessa, FL 
33556, from the Rogers’ Criminal Responsibility Assessment 
Scales by Richard Rogers, Ph.D. Copyright 1984 by PAR, Inc. 
Further reproduction is prohibited without permission from PAR, 
Inc. 


|| COMPETENCY To STAND TRIAL 


The Sixth Amendment to the U.S. Constitution, 
passed in 1791, guarantees every accused citizen 
the right to an impartial, speedy, and public trial 
with benefit of counsel. If the defendant is unable 
to exercise these constitutional rights for any rea- 
son, then a proper trial cannot take place. Specifi- 
cally, if the defendant has a mental defect, illness, 
or condition that renders him or her unable to un- 


derstand the proceedings or to assist in his or her 
defense, the defendant would be considered in- 
competent to stand trial. This standard was con- 
firmed by the U.S. Supreme Court in Dusky v. 
United States (1960) as “whether [the defendant] 
has sufficient present ability to consult with his 
lawyer with a reasonable degree of rational 
understanding—and whether he has a rational as 
well as factual understanding of the proceedings 
against him.” In practice, competency to stand 
trial refers to four elements and distinctions (Mel- 
ton et al., 1998): 


1. The defendant’s capacity to understand the 
criminal process, including the role of the par- 
ticipants in that process 

2. The defendant’s ability to function in that 
process, primarily through consulting with 
counsel in the preparation of a defense 

3. The defendant’s capacity, as opposed to will- 
ingness, to relate to counsel and understand the 
proceedings 

4. The defendant’s reasonable degree of under- 
standing, as opposed to perfect or complete un- 
derstanding 


Most U.S. courts follow this standard, which em- 
phasizes current functioning of the accused. 

The presiding judge may request a psycholog- 
ical or psychiatric evaluation to assist in determin- 
ing a defendant’s competency to stand trial. One 
recent report indicates that more than 25,000 eval- 
uations of competency to stand trial are performed 
in the United States each year (McDonald, Nuss- 
baum, & Bagby, 1992). It is important to emphasize 
that psychologists, psychiatrists, and other mental 
health professionals merely assist in a competency 
hearing by presenting expert opinions. Only the 
judge has the power to make a competency deter- 
mination. Although there is no standard format for 
a competency determination, most judges request 
that the psychologist consider most or all of the 11 
factors cited in Table 10.11. 

Incompetency to stand trial is entirely separate 
from legal insanity; these two issues are judged by 
completely different standards. Legal insanity per- 
tains to the moment of the criminal act, whereas 
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TABLE 10.11. Factors Considered in Determining 
Competency to Stand Trial 





1. Defendant’s appreciation of the charges 
2. Defendant’s appreciation of the nature and range 
of penalties 
3. Defendant’s understanding of the adversary nature 
of the legal process 
4. Defendant’s capacity to disclose to attorney perti- 
nent facts surrounding the alleged offense 
5. Defendant’s ability to relate to attorney 
6. Defendant’s ability to assist attorney in planning 
defense 
7. Defendant’s capacity to realistically challenge 
prosecution witnesses 
8. Defendant’s ability to manifest appropriate court- 
room behavior 
9. Defendant’s capacity to testify relevantly 
10. Defendant’s motivation to help himself in the legal 
process 
11. Defendant’s capacity to cope with the stress of 
incarceration prior to trial 





Source: Florida Rules of Criminal Procedure, cited in Wrights- 
man, L. S., Nietzel, M. T., & Fortune, W. H. (1994). Psychology 
and the legal system (3rd ed.). Pacific Grove, CA: Brooks/Cole. 


incompetency implies a current, ongoing condition. 
Furthermore, incompetency is not synonymous with 
mental illness, although the two may occur together. 
In the event that the judge rules the defendant in- 
competent, the trial is postponed, usually for a pe- 
riod of six months or so. In some cases, persons 
found incompetent are placed in a mental institution 
for treatment to restore their competency so that a 
trial can be held later. Individuals charged with less- 
serious crimes may receive outpatient treatment. 
In addition to information obtained from the 
clinical interview, psychological test results are im- 
portant components of a competency evaluation. 
For example, a low IQ may constitute evidence of 
incompetence in the eyes of the court. Although 
there are no firm guidelines, most courts rule that 
persons with significant intellectual deficits—say, 
an IQ in the range of moderate mental retardation 
or lower—are incompetent to stand trial. Likewise, 
a pattern of test results indicating severe neuropsy- 


chological deficit may warrant a finding of legal in- 
competence, even if the client’s IQ is in the normal 
range. For example, a defendant with severe stroke- 
induced deficits in language comprehension may 
be found incompetent to stand trial. 

Several formalized screening tests and proce- 
dures are available to assist in competency evaluation. 
These instruments are summarized briefly in Table 
10.12. We focus our attention here upon the Compe- 


TABLE 10.12 Competency Assessment 
Instruments 





Competency Screening Test 

(Lipsitt, Lelos, & McGarry, 1971) 
The CST is a 22-item sentence completion test with an 
objective scoring guide. Some reviewers find the scor- 
ing criteria to be vague and also complain that the test 
has poor face validity and weak content validity. 


Competency Assessment Instrument 

(Laboratory for Community Psychiatry, 1974) 
The CAI is a structured interview of 13 functions rele- 
vant to competent functioning at trial. One concern is 
that examiners use different approaches to administer 
the scale. 


Interdisciplinary Fitness Interview 

(Golding, Roesch, & Schreiber, 1984) 
The IFI is a semistructured interview of the defendant 
in 5 legal areas and 11 categories of psychopathology. 
Ideally, the defendant’s attorney should be present 
during administration. 


Fitness Interview Test 

(McDonald, Nussbaum, & Bagby, 1992) 
The FIT is a semistructured rating of the defendant re- 
garding 24 legal issues and 14 psychiatric areas. The 
FIT reveals excellent interrater reliability and discrimi- 
nates well between groups of defendants rated as being 
fit and or unfit to stand trial. Pending its revision, cau- 
tion is advised in the use of the scale. 


Georgia Court Competency Test-Revised 

(Johnson & Mullett, 1987) 
The GCCT-R is a 21-item oral test of understanding 
courtroom procedure, knowledge of the charge, knowl- 
edge of possible penalties, and ability to communicate 
rationally. Some reviewers warn that the test inadequately 
samples the domain of competence-related abilities. 





TOPIC 10B FORENSIC APPLICATIONS OF ASSESSMENT 387 


tency Screening Test, which is probably the most 
widely used adjunct to competency evaluation. 
Lipsitt, Lelos, and McGarry (1971) developed 
the Competency Screening Test (CST) in an at- 
tempt to formalize the competency evaluation. The 


CST consists.of 22 incomplete sentences that focus 
on courtroom procedures, attorney-client relation- 
ships, and thought processes of the defendant 
(Figure 10.2). Each response is scored 0 (poor), 
1 (borderline but not clearly appropriate), or 








. The lawyer told Bill that 


When I go to court, the lawyer will 


. Jack felt that the judge 


When Phil was accused of the crime, he 


. When I prepare to go to court with my lawyer 

. If the jury find me guilty, I 

. The way a court trial is decided 

. When the evidence in George’s case was presented to the jury 

. When the lawyer questioned his client in court, the client said 

. If Jack had to try his own case, he 

. Each time the D.A. asked me a question, I 

. While listening to the witnesses testify against me, I 

. When the witness testifying against Harry gave incorrect evidence, he 
. When Bob disagreed with his lawyer on this defense, he 

. When I was formally accused of the crime, I thought to myself 
. If Ed’s lawyer suggests that he plead guilty, he 

. What concerns Fred most about his lawyer 

. When they say a man is innocent until proven guilty 

. When I think of being sent to prison, I 

. When Phil thinks of what he is accused of, he 

. When the jury hears my case, they will 

. If I had a chance to speak to the judge, I 


2. When I go to court the lawyer will 


(a) Legal criteria: ability to cooperate in own defense, communicate, relate. 


(b) Psychological criteria: ability to relate or trust 


Score 2: 
Examples: “defend me” 
“be there to help me” 
“do his best to get me off with a light sentence” 
“represent me” 
“present my case” 
Score 1: 


Examples: “be there” 
“ask for a postponement” 


“ask me to take the stand” 
Score 0: 
Examples: “put me away” 

“keep his mouth shut” 


“prosecute me” 


FIGURE 10.2 
Competency Screening 
Test 


Source: Reprinted with 
permission from Lipsitt, P. D., 
& Lelos, D. (1970). Compe- 
tency screening test. Boston: 
Competency to Stand Trial 
and Mental Iliness Project. 
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2 (clearly appropriate) according to sample scoring 
standards. In a cross-validation study based on 50 
male residents of a state forensic unit, the CST cor- 
rectly predicted 82 percent of the competency rec- 
ommendations rendered by the forensic team 
charged with determining legal competency or in- 
competency. The CST was found to gauge compe- 
tence accurately, but overpredicted incompetence 
(Nottingham & Mattson, 1981). In general, the 
CST and similar instruments are a useful beginning 
to a competency evaluation, but should not be the 
sole method of assessment. Additional competency 
screening tests are reviewed by Bagby, Nicholson, 
Rogers, and Nussbaum (1992), Melton et al. 
(1998), and Nicholson, Robertson, Johnson, and 
Jensen (1988). 

A serious concern in competency evaluations is 
whether the client is malingering. After all, delay- 
ing a trial date for a long time provides a strong mo- 
tive to appear incompetent. Clinicians have a 
variety of methods and tests (described previously) 
for identifying clients who might be malingering. 
Even so, the process of competency evaluation is 
not foolproof, as indicated by such high-profile 
cases as the Connecticut man who avoided prose- 
cution for murder (Associated Press, June 30, 


1998). This individual had allegedly murdered his. 


former girlfriend and her current boyfriend with a 
handgun, then shot himself in the head. He suffered 
brain damage and partial paralysis and was de- 
clared incompetent to stand trial by four psychia- 
trists in four separate hearings. They argued that he 
was incapable of communicating effectively with 
his lawyer. A court order that he undergo yearly 
competency evaluations was overturned, dropping 
him through the cracks and leaving him a free man. 
Nine years later he was found attending college as 
a pre-med student with a 3.3 grade point average. 
Examples like this are reason for humility and cau- 
tion when psychologists approach competency 
evaluations. ! 


1. Another (cynical) explanation is that even incompetent per- 
sons with severe brain damage will receive As and Bs when they 
attend college. Perhaps grade inflation has gone too far. 





I PREDICTION OF VIOLENCE 
|||| AND ASSESSMENT OF RISK 


Psychologists occasionally serve as consultants 
during the sentencing phase of a trial to determine 
whether a convicted defendant poses a danger to 
fellow prisoners or to community members. This 
information may be pivotal in determining the 
length of a sentence or the placement of the defen- 
dant (e.g., in minimum-, medium-, or maximum- 
security facilities). Parole decisions also may rest 
upon a determination of dangerousness. In extreme 
cases, courtroom testimony of mental health pro- 
fessionals may influence the decision whether to 
invoke the death penalty. 

Unfortunately, even though many researchers 
have attempted to devise psychological measures 
of dangerousness, no psychometric test or proce- 
dure has been developed that provides an accurate 
long-range prediction of violence (Blau, 1984; 
Horowitz & Willging, 1984; Melton et al., 1998). 
In fact, it is not even possible to postdict violence 
from psychological tests alone with any degree of 
accuracy. 

Why is it so difficult to predict violence accu- 
rately? Melton et al. (1998) cite four factors that 
contribute to the challenge: 


1. Variability in the legal definition. The outcome 
to be predicted is variously defined as violence, 
risk, or dangerousness, and the criteria of legal 
relevance shift from one setting to another. 
Some statutes require proof of an overt act 
within a particular time period, whereas others 
might accept a clinical judgment of “explosive 
tendencies” as proof of violence. 

2. Complexity of the literature. The research liter- 
ature that examines the relationship between 
background factors and violence recidivism is 
enormous. However, the voluminous literature 
is “both overwhelming and disjointed” so that 
even conscientious professionals find it difficult 
to extract meaning from it. 

3. Judgment errors and biases. Mental health pro- 
fessionals are prone to subtle cognitive errors in 
their evaluation of defendants. For example, one 
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problem is the tendency to view dangerousness 
solely as a trait when, in fact, its appearance is 
always an interaction with environmental fac- 
tors (e.g., the availability of a weapon). Another 
problem is that clinicians let the salience of a 
crime affect their judgment disproportionately. 
4. Political consequences for the practitioner. 
Overpredicting violence is safer for clinicians 
than underpredicting. If a defendant is released 
on the basis of a clinician’s prediction and then 
commits a violent crime, the practitioner is vul- 
nerable to legal action for negligent release. In 
borderline cases, there is a very strong incentive 
to lean in the direction of a “dangerous” finding. 


A recent study by Menzies, Webster, McMain, 
Staley, and Scaglione (1994) illustrates the diffi- 
culties in predicting violence. These researchers 
used a 15-item Dangerous Behavior Rating Scheme 
(DBRS) to predict violence over a six-year follow- 
up of 162 accused persons remanded for evalua- 
tion of dangerousness. The DBRS consisted of 15 
7-point Likert scale items measuring personality at- 
tributes (e.g., hostility, emotionality, capacity for 
empathy), environmental factors (e.g., support, 
stress), situational factors (e.g., dangerousness in- 
creased with alcohol), and general items (e.g., dan- 
gerous to others in the present). In addition, a variety 
of prediction indices based upon ratings by nurses, 
psychiatrists, social workers, and others were used 
to predict an assortment of violence outcome mea- 
sures at intervals ranging from one year to six years. 
Most of the correlations were on the order of .2, and 
very few predictors managed to exceed an upper 
limit of .40. The highest correlation was .53, but this 
. pertained to the restricted case of short-term predic- 
tion of violence in those hospitalized on a psychi- 
atric ward. Overall, the results were so discouraging 
that the researchers advised against “clinical or psy- 
chometric involvement in the identification of po- 
tentially violent clinical or correctional subjects.” 

A particular concern in the prediction of vio- 
lence is the very high false-positive rate usually ob- 
served in controlled studies. This means that large 
proportions of those predicted to be violent subse- 
quently reveal no violent behaviors. Gardner, Lidz, 


Mulvey, and Shaw (1996) asked clinicians ina psy- 
chiatric emergency room to rate their concern (from 
0, no concern, to 5, great concern) that 784 patients 
would engage in violence over the following six 
months. The ratings of the clinician and the attend- 
ing psychiatrist were combined (a total of 10 points 
possible) to produce a more stable estimate. A total 
of 327 patients received total scores of 6 or greater, 
indicating a significant degree of clinical concern. 
Of this group, 49 percent were found to be com- 
pletely free of violent behavior over the next six 
months! A simple statistical formula using back- 
ground information was significantly more accurate 
in making predictions, but still yielded a 42 percent 
false-positive rate. Both methods—the clinical and 
the statistical—also yielded high levels of false neg- 
atives (patients not suspected of future violence who 
subsequently engaged in such acts). 

Although the long-range prediction of danger- 
ousness with psychological tests or rating scales 
has proved to be difficult, the short-range assess- 
ment of risk with violent offenders has met with 
moderate success. Analyzing the prior history of vi- 
olent offenders, Hall (1987) demonstrated that the 
following variables can be used to derive a proba- 
bility of violence in the three months ensuing a 
forensic evaluation: 


e Recency of prior violent act(s) 

e Number of previous serious acts of violence 

e Substance abuse within the previous month 

e Actual or threatened breakup of love relationship 

¢ Work problems leading to discipline or termination 

e Opportunity for violence, such as access to a 
handgun 


By quantifying the preceding factors, the examiner 
arrives at a probability of continued violence in the 
short-range future (Hall, 1987). This strategy is use- 
ful mainly with examinees who have a past history of 
violence, so it has limited applicability in the foren- 
sic field. However, for particular persons who meet 
most or all of the previously listed criteria, a predic- 
tion of violence can be correct more often than not. 
Quinsey, Harris, Rice, and Cormier (1998) 
reported similar success in appraising the risk of 


390 CHAPTER10 SPECIAL SETTINGS FOR PSYCHOLOGICAL ASSESSMENT 


violent recidivism in more than 600 convicted 
offenders. Based upon the empirical literature, they 
constructed the Violence Risk Appraisal Guide 
(VRAG), an objective rating scale defined by the 
following twelve offender criteria: 


1. Raised by nonbiological parents 

2. Degree of elementary schoo! maladjustment 

3. History of alcohol problems 

4. Marital status of never married 

5. Criminal history score (separate guide) 

6. Failure on prior conditional release 

7. Early age at first major offense 

8. Injury to victim at first major offense 

9. Female victim(s) at first major offense 
10. Diagnostic criteria met for personality disorder 
11. Diagnostic criteria not met for schizophrenia 
12. Psychopathy checklist score (separate guide) 


Weighted ratings in the range of —5 to +5 were as- 
signed for each criterion, with a maximum possible 
score of 38 points. For offenders who obtained very 
high scores (28 points or more), the probability of 
violent recidivism in the 10-year follow-up was 
1.00. In other words, for this small group of of- 
fenders, the psychologists could predict with cer- 
tainty that a violent criminal act would occur when 
these individuals were released to the community. 
For the next highest group of offenders, those who 
obtained scores of 21 to 27 points, the probability 
was .82. In other words, in this group, four out of 
five individuals committed a violent offense after 
release. In general, the probability of violent re- 
cidivism was a direct linear function of VRAG 
score. At the other end of the spectrum, offenders 
who received the lowest VRAG scores virtually 
never committed a subsequent violent offense. The 
approaches highlighted here demonstrate once 
again the potency of biodata in predicting behavior. 


EVALUATION OF CHILD 
CUSTODY IN DIVORCE 


Psychologists may testify in the child custody dis- 
putes that arise after divorce. Actually, these dis- 
putes are rare—in 90 percent of divorce cases, both 


parents agree upon custody. arrangements without 
resort to legal intervention (Melton et al., 1998). 
The role of the consultant is to offer expert opinions 
on such matters as the suitability of the parents or 
the best interests of the child. These opinions usu- 
ally are based upon an assessment of the child (or 
children) and both parents and may include psy- 
chological testing. 

Unfortunately, testimony based upon psycho- 
logical test results rarely provides a useful basis for 
helping a judge make the Solomonic decision re- 
quired in a custody dispute (Blau, 1984). The es- 
sential weakness of psychological tests in this 
regard is that the link between test findings and ef- 
fective parenting is weak or nonexistent. The 
Rorschach technique and other personality tests 
were not designed to assess parents’ relationships 
to children and are therefore largely irrelevant to 
the real issues in child custody cases. 

Unfortunately, clinicians have been slow to re- 
alize their forensic limitations in child custody dis- 
putes. As a result there “is probably no forensic 
question on which overreaching by mental health 
professionals has been so common and so egre- 
gious” (Melton et al., 1998, p. 484). Understandably, 
legal practitioners are skeptical about the value of 
psychological testimony in child custody cases. 

The American Psychological Association has 
acknowledged the tension between psychology and 
the legal system by publishing guidelines for child 
custody evaluations in divorce proceedings (APA, 
1994a). These guidelines refer to concerns about 
the “misuse of psychologists’ influence” in custody 
proceedings and offer principles of practice for this 
increasingly complex area. The guidelines specify 
that the best interest of the child is paramount. They 
further specify that specialized training is required, 
including familiarity with applicable legal stan- 
dards and procedures, and knowledge of laws gov- 
erning divorce and custody adjudication in the local 
jurisdiction. The guidelines also prescribe a com- 
plete neutrality on the part of the psychologist: 

The psychologist, in a balanced, impartial manner, 

informs and advises the court and the prospective 

custodians of the child of the relevant psychologi- 
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cal factors pertaining to the custody issue. The psy- 
chologist should be impartial regardless of whether 
he or she is retained by the court or by a party to 
the proceedings. (APA, 1994, p. 678) 


This last guideline is especially important for the 
profession insofar as it promotes dignity in the 
practice of psychology. Nonetheless, impartiality is 
a difficult quality to maintain insofar as subtle pres- 
sures arise when a partisan attorney retains a psy- 
chologist for custody evaluation. 

In response to the absence of a scientific data 
base, anumber of experimental assessment devices 
have been proposed for child custody forensic pro- 
cedures, including rating scales that attempt to 
gauge the potential for effective child rearing. One 
such test is the Ackerman-Schoendorf Scales for 
Parent Evaluation of Custody (ASPECT; Acker- 
man & Schoendorf, 1992). The ASPECT is essen- 
tially a battery of standard tests (Rorschach, 
MMPI-2, WAIS-R, WRAT-R for the parents, and 
Draw-A-Family, CAT, and an IQ measure for the 
child) used alongside open-ended questionnaires, 
interviews, observations, and court records (where 
necessary) as a database from which the practi- 
tioner completes a 56-item yes-no inventory (sep- 
arate inventories are completed for the father and 
the mother). Results from the 56 yes-no items yield 
three subscores and a total score, the Parental Cus- 
tody Index (PCI) for each parent. All scores are re- 
ported as T scores with a mean of 50 and standard 
deviation of 10. The PCI is intended to identify 
which parent is more effective—and how much 
more. A T-score difference of 10 points or more on 
the PCI is interpreted as suggesting that the higher- 
scoring parent might be a better choice for custody. 
In general, T-score differences of 10 to 15 points 
are considered significant, differences of 16 to 20 
points are very significant, and differences of more 
than 20 points are marked. When both parents have 
PCI scores above 60, it is thought likely that either 
would be an effective parent. If neither parent is ef- 
fective (scores of 40 or below), the PCI is intended 
to reflect this finding as well. 

For the total PCI score, interrater reliability 
ranges from .92 to .96 and is considered adequate. 


Reliability data for the three subscales consist- 
ing of 


e Observational, or the parent’s appearance and 
presentation 

e Social, or the parent’s interaction with others, in- 
cluding the child 

¢ Cognitive-Emotional, or the parent’s psycholog- 
ical and mental functioning 


are substantially weaker, indicating that practition- 
ers should rely upon the PCI score only. A large 
standardization sample was used (N = 200), but 
these families were predominantly white (97 per- 
cent), so that the ASPECT may not be appropriate 
for use with minority families. 

Validity of the PCI score was assessed by com- 
paring ASPECT recommendations with judges’ de- 
cisions in custody cases. In those cases in which 
there was a significant difference between the 
ASPECT scores of the mother and the father (10 
points or more), the test showed 93 percent agree- 
ment with the judges’ decisions. Even so, review- 
ers have recommended caution in the use of the 
ASPECT for several reasons. One concern is the 
dearth of independent research in refereed journals 
pertinent to the validity of the instrument. Melton 
(1995) notes that the validation study was based 
upon the same families as those used to develop the 
instrument, which could lead to inflated hit rates. 
Another concern is that some scale items (e.g., 
whether the parent’s IQ is more than five points 
below that of the child; which parent the child is 
placed next to in the child’s drawing) have not been 
shown to indicate parental competence or child out- 
come in divorce studies. Heinze and Grisso (1996) 
conclude that the ASPECT needs more normative, 
reliability, and validity data before practitioners can 
place confidence in the assessment battery. 

One problem with ASPECT is the enormity 
of the entire battery, which includes lengthy tests 
with both parents and all children, as well as inter- 
views and analysis of abundant (excessive?) test 
information. Perhaps the same kind of assessment 
information could be obtained in a manner less 
time-consuming. This is the approach taken by 
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Gordon and Peek (1989), who developed the Cus- 
tody Quotient™, a sophisticated item-based rating 
scale that yields a nine-point rating for each of the 
following parenting factors: 


Emotional needs of child now and in future 
Physical needs of the child now and in future 
No dangers to the child emotionally or physically 
Good parenting 

Parent assistance 

Planning for the child 

Home stability 

Prior caring 

Acts or omissions 

Values 


The instrument also yields an overall custody quo- 
tient, or CQ (mean of 100), designed to gauge 
parental competence. The authors report some em- 
pirical data about their instrument, including inter- 
rater agreement studies based on evaluation of 
videotaped interviews. For most scales, the two au- 
thors were within two points of each other (on the 
nine-point rating scales) 90 to 100 percent of the 
time. The validity of the instrument has been as- 
sessed, in part, by confirmatory factor analysis 
which reveals a “general good parenting factor” as 
predicted. An initial validation study revealed that 
parents with a CQ less than 100 rarely received cus- 
tody of the child (except in cases where both parents 
received ratings below 100). Gordon and Peek 
(1989) acknowledge that their instrument needs ad- 
ditional research. Bischoff (1992) reminds potential 
users that the CQ is a research instrument that should 
not be a major basis for custody determinations. 

Another test for child custody evaluation is the 
Parent-Child Relationship Inventory (PCRI; Gerard, 
1993). The PCRI is a 78-item self-report inventory 
suitable for mothers and fathers of children ages 3 to 
15 years. The inventory includes seven scales: 


¢ Parental Support 
e Satisfaction 

e Involvement 

e Communication 
e Limit Setting 

e Autonomy 

¢ Role Orientation 


Norms are based on a nonclinical sample of more 
than 1,000 parents throughout the United States. 
Like most instruments relevant to child custody 
evaluation, more research is needed to determine the 
value of the PCRI in child custody evaluations. One 
special concern is that the test items are relatively 
transparent such that parents might consciously 
skew the test findings in a favorable direction. 

Bricklin (1995) has devised four simple tests 
that may prove helpful in custody evaluation. Two 
of the tests assess skill and investment in parenting 
for the mother and father separately. Specifically, 
the Parent Awareness Skills Survey evaluates aware- 
ness of child-rearing skills, whereas the Parent Per- 
ception of Child Profile appraises a genuine interest 
in the child. We focus here on the remaining two in- 
struments, which examine the child’s perception of 
the parents: the Bricklin Perceptual Scales (BPS) 
and the Perception-of-Relationships Test (PORT). 

The Bricklin Perceptual Scales, or BPS, con- 
sists of 64 questions asked of each child (age four 
and above) involved in a custody dispute. Thirty- 
two questions pertain to the child’s perception of 
mother and 32 to perceptions of father. The ques- 
tions fall into four areas designed to identify the 
best interests of the child: competency, supportive- 
ness, consistency, and possession of admirable 
traits. Examples of the kinds of questions asked are 
as follows: 


Competency: “If you were having trouble with 
a school report, how much would Mom be able 
to help you?” 

Supportiveness: “When you feel really upset, 
how much does Dad help you calm down?” 
Consistency: “If you tried to sleep in on a school 
day, how often would Mom make you get up at 
your regular time?” 

Possession of Admirable Traits: “If you had a 
pet, how well would Dad take care of it when you 
went to camp for a week?” 


The 32 Mom questions and 32 Dad questions are 
identical; they are widely separated in the ques- 
tionnaire so as to avoid immediate and obvious 
comparisons. The questions are asked verbally first 
and then repeated with a slightly different phras- 
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ing. For example, the examiner holds a test card up 
to the child: “If this [point to end of line marked 
‘Very Well’] is Dad doing very well at helping you 
to calm down, and this is [pointing to end marked 
‘Not So Well’] is Dad doing not so well at helping 
you to calm down, where on this line would Dad 
be?” The child responds by using a stylus to punch 
ahole in the card. Each card contains a scoring grid 
on the opposite side for immediate and objective 
scoring. 

BPS scoring consists of an item-by-item com- 
parison of Mom and Dad ratings. A parent-of- 
choice (POC) is identified in terms of which parent, 
the mother or father, obtains the higher score on the 
most items: “It is assumed that this parent is the par- 
ent better able to operate in the child’s “best inter- 
ests” in the widest variety of situations as measured 
by the BPS” (Bricklin, 1995, p. 77). Of course, this 
result is just one point of information used by the ex- 
aminer in arriving at a custody recommendation. 

The BPS embodies a number of positive quali- 
ties in comparison to other approaches to custody 
evaluation. First, the test avoids asking the child to 
make a direct choice between the parents, which 
minimizes guilt. It also assesses important areas of 
caretaking presumed to be important for healthy 
child development. Furthermore, by focusing on 
the child’s perceptions of parenting as opposed to 
the examiner’s perceptions of parenting, the instru- 
ment provides a voice for those with the greatest 
stake in a custody decision. Bricklin (1995) notes 
that the BPS is not necessarily intended to be a di- 
rect measure of parental competency but instead is 
designed to identify the caretaker whose style is 
congruent with the child’s ability to take in and 
profit from parental guidance. 

Although evidence regarding the reliability of 
the BPS is not available in Bricklin (1995), initial 
validity findings are supportive. Specifically, the 
BPS yielded a 90 percent agreement rate with the 
caretaker choices of mental health professionals 
who had access to independent clinical and family 
history data collected over a period of several years. 
On a cautionary note, as is true of so many instru- 
ments in the field of custody evaluation, there are 
few (if any) empirical studies of the BPS published 


in refereed journals. Heinze and Grisso (1996) offer 
the following critique: 


A related issue is that the Bricklin Scales attempt 
to address the legal questions of preferred custody 
without direct assessment of parents’ functional 
abilities and deficits, including how these interact 
with children’s needs. In addition, the scales 
address the issue of two parents with significant 
deficits only in a very superficial way. Thus, 
because the child’s perceptions of the two parents 
are compared only to each other and not those 

of other parents, the clinician may not glean 
much information regarding a specific parent’s 
level of parental functioning as perceived by 

the child. (p. 301) 


Clearly, more research from independent sources 
would be desirable. 

Another child-based test invented by Bricklin 
(1995) is the Perception-of-Relationships Test 
(PORT). The PORT is a projective drawing test that 
can be scored to determine the parent-of-choice. 
The child is asked to complete seven sets of draw- 
ings, beginning with a drawing of each parent. The 
test then proceeds to other drawings, including a 
self-representation on the same sheet as a drawing 
of Mom, and then of Dad. A complex scoring sys- 
tem is used to identify the parent with whom the 
child seeks psychological “closeness.” This is 
thought to reveal which parent the child views as 
more supportive and better able to respond to the 
child’s needs. In a series of seven validity studies 
spanning several decades (Bricklin, 1992), the 
PORT typically yielded a 90 percent accuracy in 
predicting the parent-of-choice as designated by a 
courtroom judge or independent clinicians. 

One specialized application of the PORT men- 
tioned by Bricklin (1995) is the detection of phys- 
ical and sexual abuse. The scoring criteria that 
indicate abuse appear to have been derived from 
psychoanalytic lore. For example, all of the fol- 
lowing are thought to signal abuse: 


e A dramatic increase in the distance between the 
self-figure and the abusing parent in comparison 
to the nonabusing parent 

Wavy or broken lines in the area of bodily abuse 
(e.g., genital area, breasts) 
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e Distortion of one hand in the drawing of an abus- 
ing parent 


Unfortunately, empirical support for these hy- 
potheses is limited, as we discuss at length in a later 
topic (Topic 13B, Projective Techniques). Because 
allegations of sexual abuse are highly prejudicial in 
custody proceedings, examiners should exercise 
great caution in speculating about such matters, es- 
pecially if the evidence consists solely of a few 
human figure drawings deemed to be suspicious. 
In closing this section on custody evaluation, 
we remind the reader that expert testimony in child 
custody evaluations needs to be tempered with a 
large dose of humility on the part of the psycho- 
logical examiner. Determining the best interests of 
the child is almost always a difficult and thorny as- 
signment. The absence of well-validated tests and 
methods for this task assures that custody evalua- 
tion will remain one of the most challenging re- 
sponsibilities in all of psychological practice. 








II] 
HW 
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|| PERSONAL INJURY AND 
| RELATED TESTIMONY 


||| 

Personal injury as from an automobile accident is 
often a source of litigation for monetary compen- 
sation. In personal injury lawsuits, attorneys may 
hire psychologists to testify as to the lifelong con- 
sequences of traumatic stress or acquired brain 
damage. For example, a clinical neuropsychologist 
might administer a comprehensive test battery (see 
Chapter 9, Neuropsychological and Geriatric As- 
sessment) and then testify as to the long-term func- 
tional implications of known brain damage. 

In general, a consulting psychologist who tes- 
tifies in court will encounter extremely high prac- 
tice standards. We have already mentioned the Frye 
standard, which provides that testimony must be 
based upon tests and procedures that have “gained 
general acceptance” in the field. Thus, a test or pro- 
cedure that is relevant or useful in everyday clini- 
cal practice—but which is not widely accepted in 
the field—might be greeted with skepticism in the 
courts. A judge may even rule that testimony is in- 
admissible if it is based upon tests or procedures 
with flimsy validation. Worse yet, the judge may 


allow such testimony, which opens the expert wit- 
ness to criticism and ridicule by opposing attor- 
neys. With these concerns in mind, Heilbrun (1992) 
has published guidelines for the practice of foren- 
sic assessment, which we summarize here: 


Tests should be commercially available and well 
documented, such as in Mental Measurements 
Yearbook. 

Reliability should be well documented and .8 or 
higher, except in unusual circumstances. 

Tests should be relevant to the legal issue, or to a 
psychological construct underlying the legal 
issue. 

e Standard administration should be used, with 
ideal testing conditions. 

Tests should have been validated in a population 
that is relevant to the individual being assessed. 
« Where possible, practitioners should use objective 
tests with actuarial formulas for interpretation. 
Practitioners should check carefully for malin- 
gering, defensiveness, and other reasons to dis- 
count or ignore the test data. 


Additional guidelines can be found in the “Spe- 
cialty Guidelines for Forensic Psychologists” 
(Committee on Ethical Guidelines for Forensic 
Psychologists, 1991). 

Increasingly, courts have been willing to com- 
pensate mental injuries in addition to physical in- 
juries. The damage is variously referred to as 
“psychic trauma” or “emotional distress” or “emo- 
tional harm.” The evaluation of emotional injury 
will rely somewhat on psychological test results 
(especially personality tests), but the assessment re- 
quires great clinical skill including “a longitudinal 
history of the impairment, its treatment, and at- 
tempts at rehabilitation, including the claimant’s 
motivation to recover” (Melton et al., 1998, p. 381). 
We see once again that the question of malingering 
haunts most forms of forensic assessment. 


||| INTERPRETATION OF 
| POLYGRAPH RECORDS 


Although polygraph results are not routinely ad- 
mitted in court cases, about 20 states allow testi- 
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mony if both parties in a trial agree that the evi- 
dence can be used. At least two states (Massachu- 
setts and New Mexico) allow testimony even if the 
other side objects (Wrightsman et al., 2002). In 
cases involving polygraph records, psychologists 
may be asked to offer expert testimony. We provide 
here a brief review of the art and science of poly- 
graph examination. 

The polygraph was developed around 1917 by 
William Marston, a student of Hugo Munsterberg 
at Harvard University. Marston’s crude approach 
consisted of measuring systolic blood pressure dur- 
ing questioning of a suspect. He believed that sys- 
tolic blood pressure would rise gradually if the 
examinee attempted to deceive the examiner. His 
controversial technique was the procedure ques- 
tioned in the landmark Frye v. United States (1923) 
case. The reader will recall that the Frye decision 
helped define the qualifications of an expert wit- 
ness as one who uses an approach which has 
“gained general acceptance in the particular field in 
which it belongs.” In the Frye case, the court did 
not allow Marston’s testimony because his “lie de- 
tector” approach was not widely accepted. 

The modern polygraph is substantially more 
sophisticated than Marston’s crude predecessor. In 
a modern polygraph test, the examinee has 
several monitors attached to his or her body: a blood 
pressure cuff, a heart rate monitor, a flexible ring 
around the chest to monitor breathing, and electrical 
leads to the fingers to detect changes in electro- 
dermal activity. The polygraph monitors ongoing 
physiological responses, including changes in 
breathing, pulse rate, blood pressure, and perspira- 
tion. The changes in perspiration are not monitored 
directly, but are measured indirectly from a pair of 
electrodes that gauge changes in electrical conduc- 
tivity on the surface of the skin. Even a very slight 
increase in moisture from perspiration facilitates 
conductivity of an electrical current across the sur- 
face of the skin. Polygraph translates literally as 
“many writings.” The physiological responses are 
displayed as a set of fluctuating ink lines drawn on 
a continuously moving roll of paper. 

The justification for using the polygraph as a lie 
detector derives from the observation that many 


persons do react with increased physiological 
arousal at the moment they tell a lie. In theory at 
least, truthful responses are accompanied by rela- 
tively flat ink lines, whereas a lie is presumed to 
cause significant, detectable fluctuations in heart 
rate, perspiration, and perhaps other measures as 
well. Thus, a common procedure in polygraph test- 
ing is to compare responses to a neutral or control 
question (“Is today Tuesday?”’) with responses to a 
relevant question (“Did you rob the First Interstate 
Bank last Friday?”). 

Although a polygraph is commonly referred to 
as a “lie detector,’ this colloquial designation is 
substantially inaccurate. Polygraphers themselves 
are largely to blame for promoting the informal ap- 
pellation of “lie detector” for their instrument. Asa 
consequence, a vast mythology now surrounds the 
polygraph. In fact, a polygraph detects physiologi- 
cal responses, not lies. A distinctive physiological 
response monitored from a polygraph may or may 
not signal a lie—this is an empirical question open 
to research. 

The polygraph does not “beep” at the moment 
of a presumed lie; the pattern of physiological re- 
sponses must be interpreted by an examiner. Herein 
resides a significant limitation of the instrument: A 
great degree of judgmental interpretation is re- 
quired to determine which “blips” are significant 
and which are merely coincidental. Unfortunately, 
the “expert” judgments of experienced polygra- 
phers do not stand up well to empirical tests in the 
real world (Carroll, 1988; Lykken, 1981, 1987). In 
a field study of the accuracy of polygraph interpre- 
tation, Kleinmuntz and Szucko (1984) painted what 
is perhaps the most unflattering portrait of polyg- 
rapher accuracy. The authors collected polygraph 
readings from 50 persons later substantiated to be 
thieves and 50 persons suspected of the same thefts 
who were later exonerated. In addition, 20 unveri- 
fied cases were added as “buffer” or “filler” cases, 
but not included in the analyses. 

The polygraph examinations were presented to 
six highly experienced polygraphers. These inter- 
preters were told that half the sample was guilty 
and half innocent. The results of the study consisted 
of a tally of accurate and inaccurate identifications 
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on the part of the interpreters. The reader will rec- 
ognize that the interpreters would be correct 50 per- 
cent merely by guessing. Also, we should point out 
that several categories of classification were possi- 
ble: valid positives (guilty persons identified as 
guilty), valid negatives (innocent persons identified 
as innocent), false positives (innocent persons iden- 
tified as guilty), and false negatives (guilty persons 
identified as innocent). Kleinmuntz and Szucko 
(1984) also converted the polygraph readings to 
digital data for purposes of discriminant function 
analysis. In a discriminant function analysis, purely 
objective statistical factors are used to classify sub- 
jects with as much accuracy as possible. 

The results of the study were unexpectedly 
discouraging. On average, the interpreters were 
accurate 69 percent of the time, only a 19 percent 
improvement over guessing. The discriminant 
function analysis performed slightly better at 73 
percent. The single best interpreter was correct only 
76 percent of the time. The error rates in this care- 
ful study were alarmingly high, especially for false 
positives—innocent persons identified as guilty. 
Even the best polygraph interpreter still classified 
18 percent, or 9 of the 50, innocent persons as 
guilty. 

Blinkhorn (1988) summarizes a psychometri- 
cally informed perspective on the polygraph as 
follows: 


The polygraph used as a lie detector falls very far 
short of acceptable standards for psychological tests. 
It is essentially unstandardized; it is internally in- 
consistent; rescoring of charts is unreliable; no retest 
reliability information is available for examinees; it 
produces a disproportionate number of false positive 
results; it has not been investigated for adverse im- 
pact on social and ethnic subgroups; it involves mea- 
sures sensitive to aspects of temperament. (p. 39) 


Other reviews are more favorable, but not so fa- 
vorable that legal experts will rush forward to pro- 
mote the polygraph as evidence in court. Saxe, 
Dougherty, and Cross (1985) reviewed 10 field 
studies of polygraph accuracy. In these studies, the 
average rate of correct classification of the inno- 
cent was 89 percent (range of 77 to 100 percent), 
whereas the average rate of correct classification of 


the guilty was only 82 percent (range of 47 to 100 
percent). More recently, a major field study of poly- 
graph testing by the U.S. Secret Service concluded 
that examiners correctly identified 96 percent of the 
truthful subjects and 95 percent of the suspects who 
lied (Raskin, 1989). Elaad, Ginton, and Jungman 
(1992) also analyzed the accuracy of the polygraph 
in real-life criminal investigations. They amassed 
polygraph records for paired subjects suspected of 
the same crime (e.g., theft of a videorecorder from 
a military base). In all, they used records for 40 in- 
nocent subjects and 40 guilty persons for whom 
actual truth was later established by voluntary 
confession. Using objective decision rules, 94 per- 
cent of the innocent subjects were correctly clas- 
sified and 76 percent of the guilty subjects were 
correctly identified. Collectively, all of these stud- 
ies support the use of the polygraph in the investi- 
gatory stages of a criminal case. However, the error 
rates are still much too high for the polygraph to 
be allowed as evidence in the courtroom (Honts & 
Perry, 1992). 

In recent years, the polygraph appears to be 
making a limited comeback. For example, the state 
of Texas keeps tabs on sex offenders with polygraph 
tests, which pits civil liberties against the safety of 
the public (Associated Press, May 16, 1997). In one 
case, a 33-year-old man, who was on probation for 
molesting a young boy, “failed” a lie-detector test 
regarding sexual contact with children. When con- 
fronted with this, he confessed sexual activity with 
a 15-year-old boy and then faced a prison sentence 
for violation of his probation. Increasingly, court 
systems are relying upon the lie detector to keep 
track of sex offenders (50,000 in the state of Texas 
alone), raising difficult legal and ethical questions. 
One concern is the problem of false positives, in 
which an innocent person is suspected of sex crimes 
based upon the blips seen on the rolling ink lines of 
a polygraph test. In a survey of 195 psychologists 
from the Society for Psychophysiological Re- 
search, most respondents answered that poly- 
graphic lie detection is not theoretically sound, that 
the lie test can be beaten by easily learned counter- 
measures, and that test results should not be admit- 
ted in courts of law (Iacono & Lykken, 1997). 
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CONTROVERSY OVER THE 
PSYCHOLOGIST AS EXPERT WITNESS 


It should be evident from the previous topics that 
the expert testimony of psychologists may alter 
lives. As Faust and Ziskin (1988) observe: “De- 
pending on the expert’s opinion, an individual may 
be confined to a mental institution, receive huge 
monetary awards, obtain custody of a child, or lose 
his or her life.” These authors also note that clini- 
cians participate in at least 1 million legal cases an- 
nually, so the potential impact of expert testimony 
is far-reaching. It is fitting to conclude this section 
on legal and courtroom issues with brief mention of 
the controversy that surrounds the psychologist as 
expert witness. 

In a three-volume book titled Coping with 
Psychiatric and Psychological Testimony, Ziskin 
(1995) provides a completely skeptical view of the 
clinician in the courtroom. He summarizes his main 
points in a shorter article (Faust & Ziskin, 1988). 
The critique rests upon two assertions: 


1. Expert witnesses in psychology (and psychiatry) 
cannot answer forensic questions with reason- 
able accuracy. 

2. These experts do not help the judge and jury 
reach more accurate conclusions than would 
otherwise be possible. 


The first contention, that expert witnesses can- 
not answer forensic questions with reasonable ac- 
curacy, rests in large measure upon the alleged 
inability of clinicians to provide accurate psychi- 
atric diagnoses. The argument goes like this: If clin- 
icians cannot perform accurately an assignment so 
fundamental as the classification or diagnosis of 
patients, then how can they possibly perform the 
more complex task of determining the prior, cur- 
rent, or future state of the person under examina- 
tion? Thus, problems with diagnostic reliability are 
used to illustrate the more general difficulty in 
achieving interclinician agreement on descriptions 
or predictions of past, current, and future status of 
the examinee. To buttress their position, Ziskin and 
Faust (1988) cite numerous studies that demon- 
strate the unreliability of psychiatric diagnosis, in- 


cluding a few analyses that show the rate of dis- 
agreement for specific diagnostic categories to 
equal or exceed the rate of agreement. 

The second contention, that expert witnesses do 
not help the judge and jury reach more accurate 
conclusions than would otherwise be possible, is 
based upon assembled research findings that sug- 


' gest that professional clinicians do not in fact make 


more accurate clinical judgments than laypersons 
(Ziskin & Faust, 1988). For example, Faust and 
Ziskin (1988) cite a classic study by Goldberg 
(1959) in which office secretaries performed as well 
as professional psychologists in distinguishing the 
visual-motor productions of normal versus brain- 
damaged subjects on the Bender Visual Motor 
Gestalt Test, a design copying test. In the Goldberg 
(1959) study, half of the Bender Gestalt drawings 
were from patients with confirmed organic brain 
disease, whereas the other half were from normal 
individuals; chance diagnostic accuracy was there- 
fore 50 percent. For each subject’s drawings, 
participants in the study were asked to make a di- 
chotomous judgment: normal versus brain dam- 
aged. On average, Ph.D. psychology staff classified 
65 percent of the protocols correctly, psychology 
graduate students 70 percent, and office secretaries 
67 percent. Ziskin and Faust (1988) cite many other 
studies that illustrate the same general theme of no 
difference in the accuracy of clinical judgments for 
professional clinicians and laypersons. 

In defense of expert testimony, several authors 
have criticized the Faust and Ziskin (1988) critique 
for its partisan scholarship and lack of balance 
(Brodsky, 1989; Fowler & Matarazzo, 1988; Heil- 
brun, 1992; Matarazzo, 1990). For example, in his 
reviews of studies of diagnostic accuracy, Mata- 
razzo (1983, 1990) concluded that the findings in- 
dicate good to very good magnitudes of reliability; 
he cited many positive studies not mentioned by 
Faust and Ziskin (1988). Furthermore, Fowler and 
Matarazzo (1988) decry the lopsided emphasis 
upon diagnostic accuracy: 


Diagnostic classification alone is rarely, if ever, the 
basis on which legal determinations are made. 
Such legal questions as insanity, disability, and 
competency are based more on the judge’s or jury’s 
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understanding of an individual’s behavior and abil- 
ity to function in specific situations than on what 
specific diagnosis is assigned to that individual. 


We concur that Faust and Ziskin (1988) pro- 
duced a one-sided literature review that ignores the 
findings supportive of forensic psychology. For ex- 
ample, consider one small detail of the Goldberg 


(1959) study—not by coincidence, a detail that Faust ° 


and Ziskin (1988) failed to cite. In addition to using 
psychologists, trainees, and secretaries as judges, 
Goldberg (1959) also asked a renowned expert on 


the Bender Gestalt test to classify the test protocols. 
The expert achieved the best classification rates of 
all: an impressive 83 percent correct.” On the basis 
of a review of clinical judgment studies, Garb (1992) 
concludes that mental health professionals (and es- 
pecially psychologists) are more accurate in their 
judgments than laypersons; that is, they do have 
something to offer judges and juries in forensic 
cases. Clearly, there is no denying the fallibility of 
the individual clinician. But this point does not jus- 
tify the elimination of experts in the courtroom. 


SUMMARY 


1. The practice of psychology intersects with 
the legal system in several ways, including evalua- 
tion for malingering, assessment of mental state 
(for the insanity plea), assessment of competency 
to stand trial, prediction of violence, and child cus- 
tody evaluation. 

2. To qualify as an expert witness, a psychol- 
ogist must be a qualified expert in the opinion of the 
court. In general, testimony is restricted to tech- 
niques that have been available for a fairly long 
time in order to have a history of general accep- 
tance (Frye v. United States, 1923). 


3. In forensic assessments, the practitioner 
should have a high suspicion of malingering, but 
should reach this conclusion cautiously because of 
the consequences to the client. Tools such as the 
Structured Interview of Reported Symptoms can 
help in the assessment of malingering. 


4. The insanity defense is widely respected 
by jurisprudence experts and is invoked in fewer 
than 1 in 1,000 trials. Currently, most states follow 
the M’Naughten rule or the Model Penal Code as 
legal tests for insanity. Some jurisdictions allow 
“irresistible impulse” as a supplement. 


5. According to the M’Naughten rule, a per- 
son can be found not guilty by reason of insanity if 
“at the time of the committing of the act, the party 
accused was laboring under such a defect of rea- 
son, from disease of the mind, as not to know the 
nature and quality of the act he was doing; or, if he 


did know it, that he did not know he was doing what 
was wrong.” 


6. The Model Penal Code proposes that a 
“person is not responsible for criminal conduct if at 
the time of such conduct, as a result of mental dis- 
ease or defect, he lacks substantial capacity either 
to appreciate the criminality (wrongfulness) of his 
conduct or to conform his conduct to the require- 
ments of the law.” 


7. A recent addition to legal jurisprudence is 
the Guilty But Mentally Ill verdict. Use of this ver- 
dict has several liabilities, including confusion by 
jurors and harsher sentences than being found 
merely guilty. 


8. Rating scales such as the Rogers Criminal 
Responsibility Scales provide a helpful basis for 
evaluating the mental state of the defendant at the 
time of the offense (MSO). However, reviewers 
suggest caution in the application of this and simi- 
lar scales because they might lead to a false sense 
of scientific certainty about matters that are inher- 
ently intuitive and judgmental. 


9. Courts often ask psychologists to help de- 
termine whether a defendant is competent to stand 
trial. Competency generally implies that the de- 
fendant is capable of understanding the charges 


ee ea i 
2. The expert was Max Hutt, who later authored The Hutt Adap- 
tation of the Bender Gestalt Test (1977). 
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against him or her, can cooperate with the defense 
attorney, and can make a reasonable self-presenta- 
tion during the trial. 

10. The prediction of violence and assessment 
of risk is another court-based task that is occasion- 
ally asked of psychologists. Unfortunately, no psy- 
chometric test or procedure has been developed 
that provides an accurate long-range prediction of 
violence. In contrast, the short-range prediction of 
violence or risk has met with greater success. 


11. Additional areas of sensitive assessment 
include child custody disputes, in which psycho- 
logical test results rarely are helpful; and personal 
injury cases, in which psychological tests can play 
a crucial role in documenting the effects of injury. 


12. Several assessment devices and scales have 
been proposed to provide an objective means for 


making child custody recommendations. While 
providing useful information, these tools may 
promise more than they deliver from the standpoint 
of psychometric validity. 


13. A polygraph detects physiological re- 
sponses such as electrodermal changes to interro- 
gation questions and displays them as ink lines on 
paper. Research supports the use of the polygraph 
in the investigatory stages of a criminal case but 
error rates are still much too high for its use in the 
courtroom. 


14. Skeptics argue that expert witnesses in psy- 
chology (and psychiatry) cannot answer forensic 
questions with reasonable accuracy and do not help 
the judge and jury reach more accurate conclusions 
than would otherwise be possible. These opinions ap- 
pear to be based on one-sided reviews of the research. 


KEY TERMS AND CONCEPTS 


expert witness p. 379 
malingering p. 381 
mental state at the time of the offense 
(MSO) p.382 
not guilty by reason of insanity (NGRI) p.382 
M’Naughten rule p. 382 


Durham rule p. 383 


Model Penal Code p. 383 

Guilty But Mentally II (GBMI) p. 383 
competency to stand trial p. 385 
custody evaluation p. 394 
personal injury p. 394 


polygraph p. 395 
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Summary 


Key Terms and Concepts 


| and organizational psychology (I/O 
psychology) is the subspecialty of psychology 
that deals with behavior in work situations (Bor- 
man, Ilgen, Klimoski, & Weiner, 2003). In its 
broadest sense, I/O psychology includes diverse ap- 
plications in business, advertising, and the military. 
For example, corporations typically consult I/O 
psychologists to help design and evaluate hiring 
procedures; businesses may ask I/O psychologists 
to appraise the effectiveness of advertising; and mil- 
itary leaders rely heavily upon I/O psychologists in 


the testing and placement of recruits. Psychological 
testing in the service of decision making about per- 
sonnel (hiring, placement, promotion, and evalua- 
tion) is thus a prominent focus of this profession. Of 
course, specialists in I/O psychology have broad 
skills and often handle many corporate responsibil- 
ities not previously mentioned. Nonetheless, there 
is no denying the centrality of assessment to their 
profession. 

The purpose of this chapter is to review the nu- 
merous ways in which tests and assessment proce- 
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dures can be used by I/O psychologists for the pre- 
viously mentioned purposes. We will also intro- 
duce the reader to some of the issues and 
controversies that may arise in the selection and ap- 
praisal of employees. In Topic 11A, Personnel As- 
sessment and Selection, testing practices for 
employee selection are discussed and relevant in- 
struments are introduced. In Topic 11B, Appraisal 
of Work Performance, the vexatious problem of 
performance evaluation is addressed. Legal issues 
and the effects of government regulations upon in- 
dustrial assessment are also reviewed here. 

| A FRAMEWORK FOR PERSONNEL 

ii| ASSESSMENT AND SELECTION 


Key Issues in Personnel Testing 














It is important to have a framework for under- 
standing the intricate, multifaceted role of the I/O 
psychologist in personnei assessment and selec- 
tion. A crucial point—often overlooked in the cov- 
erage of tests and assessment approaches—is that 
T/O psychologists must be more than mere special- 
ists in testing. This point is particularly important 
in personnel selection, which requires so much 
more than knowledge of the tests that might be 
used. In order to use tests wisely in this capacity, 
the I/O psychologist must be conversant with sev- 
eral complex issues. For example, detailed knowl- 
edge of decision theory, discussed in Topic 4A, 
Basic Concepts of Validity, is essential for the effi- 
cacious use of tests in personnel selection. Ele- 
ments of decision theory such as the Taylor-Russell 
tables provide a basis for determining the expected 
proportion of successful applications selected with 
a test. An I/O psychologist would be foolish to em- 
bark on a new program of employee selection with- 
out considering Taylor-Russell tables and decision 
theory. In addition, the prudent consultant should 
be well grounded in the concepts of test bias, dis- 
cussed in Topic 7B, Test Bias and Other Contro- 
versies. For legal and ethical reasons, businesses 
disdain the use of biased tests. Typically, it is the 
psychologist’s responsibility to select or develop 
nonbiased tests for personnel selection. The I/O 


TABLE 11.1 A Compilation of Key Issues in 
Testing and Assessing for Personnel Selection 


Job analysis: What are the specific criteria for effective 
job performance? 

Tests and assessments: Do the selection devices and 
procedures possess a demonstrated relationship to ef- 
fective job performance? 

Cutoff scores: What is the expected proportion of 
successful applicants selected with a test of known 
validity? 

Cost effectiveness: Including the costs of testing and 
selection, which assessment procedures yield the great- 
est overall benefit? 

Test bias: Do the tests and assessment procedures evi- 
dence bias against one or more minorities? 

Legal guidelines: Do the tests and selection procedures 
meet federal guidelines for fair employment testing? 
Validity studies: What is the ongoing validity of the 
personnel selection program? 





psychologist should be familiar with issues of test 
bias, including the objective criteria by which tests 
are evaluated for bias. Likewise, the personnel spe- 
cialist must understand legal issues in employment 
testing, discussed at the end of this chapter. 

The many concerns of the I/O psychologist in 
personnel selection are profiled in Table 11.1, 
which provides a synopsis of major issues in test- 
ing and assessment. This list of key issues is not 
meant to be exhaustive, but it does convey the 
breadth of concerns encountered in the application 
of tests to personnel selection. The reader will no- 
tice that one area of expertise needed by the /O 
psychologist is job analysis, which is the identifi- 
cation of criteria for effective job performance. A 
thorough job analysis can provide the basic build- 
ing blocks for personnel selection. For this reason, 
we discuss approaches to job analysis before turn- 
ing to other issues in personnel selection. 


Job Analysis 


For large corporations and small businesses alike, 
the basic building blocks of organizational success 
are the individual jobs performed by specific 
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employees at all levels — from management on 
down to.neophyte recruit,, The manner in which 
jobs are defined is crucial to organizational direc- 
tion and growth. Cascio (1987) has outlined the 
questions that may be asked when new or existing 
organizations are scrutinized from the standpoint of 
employee positions: 


How many positions will we have to staff and what 
will be the nature of these positions? What abili- 
ties, skills, and personal characteristics will be re- 
quired of the individual jobholders? How many 
individuals should we recruit? What factors (per- 
sonal, social, and technical) should we be con- 
cerned with in the selection of these individuals? 
How should they be trained, and what criteria 
should we use to measure how well they have per- 
formed their jobs? 


Before any of these questions can be answered, ad- 
ministrators must define the jobs in question; and 
then identify the employee skills and behaviors 
necessary to perform each job. This process, known 
as job analysis, properly falls within the province 
of the I/O psychologist. We will review here some 
of the assessment issues raised by job analysis and 
survey a few standardized questionnaires used for 
this purpose (Brannick & Levine, 2002). 

Broadly speaking, job analysis consists of 
defining a job in terms of the behaviors necessary 
to perform it. Job analysis includes two major com- 
ponents: job description and job specification: The 
job description identifies the physical and environ- 
mental characteristics of the work to be done, 
whereas the job specification details the personal 
characteristics necessary to do that work. For 
example, a job description for office secretary 
might note that he or she must occasionally handle 
phone complaints, whereas the corresponding job 
specification might list tolerance with difficult 
people as essential. Another example: A job de- 
scription for police officer might specify that he 
or she must be able to “Carry a person who has 
been arrested and is unable or refuses to walk by 
grasping the person under the arms, supporting his 
weight, and transporting him to the police car,” 
whereas the corresponding job specification might 
refer to a minimum level of physical strength and 
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stamina. We are getting ahead of the story here, 
but it would even be possible to operationalize 
“strength and stamina” as that which is needed to 
drag a 120-pound dummy a distance of 50 feet 
within one minute (Arvey, Landon, Nutting, & 
Maxwell, 1992). 

The primary justification for job analysis is 
that it helps provide a valid basis for making per- 
sonnel decisions. Not only is this desirable from an 
ethical standpoint, it also meets legal standards 
mandated by the courts and regulatory bodies. In 
Albemarle v. Moody (1975) the U.S. Supreme 
Court ruled that job analysis must be included in 
validation studies that purport to demonstrate a 
relationship between a selection device and job 
performance. Also, the Uniform Guidelines on Em- 
ployee Selection (1978) require job analysis as a 
component of validation studies. As outlined by 
Cascio (1987), additional justifications for job 
analysis include organizational resource planning, 
training and personnel development, safety and im- 
provement of job methods, and personnel research. 

Numerous methods can be used to perform a 
job analysis, and a full discussion would take us too 
far afield. Briefly, Cascio'(1987) lists the following 
approaches: 


¢ Direct observation of job incumbents 

+ Structured interview of workers 

¢ Collection of critical incidents from supervisors 
e Checklists of duties and skills 

e Questionnaires 


Another useful source of information for job 
analysis is the Dictionary of Occupational Titles, 
or DOT (U.S. Department of Labor, 1991, 1993). 
The DOT provides standardized occupational in- 
formation to support job placement-activities. For 
example, the DOT listing for psychometrist consists 
of the following: 


045.067-018 PSYCHOMETRIST (profess. & kin.) 


Administers, scores, and interprets intelligence, 
aptitude, achievement and other psychological 
tests to provide test information to teachers, coun- 
selors, students, or other specified entitled party: 
Gives paper and pencil tests or utilizes testing 
equipment, such as picture tests and dexterity 
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boards, under standard conditions. Times, tests and 
records results. Interprets test results in light of 
standard norms, and limitations of test in terms 

of validity and reliability. (U.S. Department of 
Labor, 1991) 


The DOT also provides occupational information 
pertinent to educational preparation, physical de- 
mands of the job, and environmental conditions typ- 
ically encountered. For example, the position of 
psychometrist requires two to four years of higher 
education and is defined as a sedentary occupation 
with frequent need for reaching, handling, manipu- 
lating, and talking. Good hearing and good close-up 
vision are also listed as necessary physical charac- 
teristics of the position (U.S. Department of Labor, 
1993). The DOT is an important source of informa- 
tion for job analysis, particularly for traditional po- 
sitions that are well defined. Recently, the U.S. 
Department of Labor began an ambitious initiative 
to analyze virtually all jobs in the economy accord- 
ing to the content or work activities required (Pe- 
terson, Mumford, Borman, and others, 1995). The 
resulting database is called O*Net, and it promises 
to be invaluable in matching people with jobs. 

But not all jobs are clearly defined, which 
means that the I/O psychologist may need to con- 
duct a formal job analysis. This is more difficult 
than it might appear. Morgeson and Campion 
(1997) have outlined the pitfalls of accurate job 
analysis, which include social factors such as con- 
formity in job analysis committees and problems of 
information overload that stem from the complex- 
ity of the task. We focus here upon questionnaires 
for job analysis, because they raise a number of in- 
teresting issues in assessment. 

Structured, quantifiable questionnaires for job 
analysis did not become popular until the Position 
Analysis Questionnaire (PAQ) appeared in the 
early 1970s (McCormick, Mecham, & Jeanneret, 
1972). This instrument introduced a novel “worker- 
oriented” concept of job analysis that allowed I/O 
psychologists to make meaningful comparisons be- 
tween different jobs. Still popular today, the PAQ 
consists of 194 items or job elements from five 
categories (McCormick, Jeanneret, & Mecham, 
1989): 


® Information input: How and where does the 
worker get the information needed for the job? 
Mental processes: What kinds of reasoning, 
planning, and decision making are required by 
the job? 

Work output: What are the physical activities per- 
formed and the tools or devices used by the 
worker? 

Personal relationships: What kinds of relation- 
ships with others are inherent to the job? 

Job context: What are the physical and social 
contexts in which the work is performed? 


Some of the individual job elements are simply 
checked “yes” or “no,” whereas other items are 
rated on an appropriate scale such as importance, 
time, or difficulty. For example, in one subsection, 
job analysts rate (on a scale of 1 to 5) five aspects 
of oral communication (advising, negotiating, per- 
suading, instructing, and interviewing) as to im- 
portance for a specific job. 

The PAQ has respectable reliability when used 
by trained job analysts. Interrater reliability is 
typically around .80 for the overall instrument 
(McCormick et al., 1972; Mecham, McCormick, & 
Jeanneret, 1977). Validity is less well established, 
although Cascio (1987) has noted that PAQ results 
are not affected by the sex of the analyst, the inter- 
est level of the job incumbents, or the amount of in- 
formation provided about the job. The instrument 
does have significant limitations, including the high 
reading level needed (college graduate for some 
items) and limited suitability for professional and 
managerial positions. Perhaps a more serious limi- 
tation is that the emphasis upon behavioral aspects 
of work may overlook important task differences 
between highly dissimilar jobs. For example, Arvey 
and Begalla (1975) have noted that a police officer’s 
profile is quite similar to a housewife’s profile. For 
example, both positions may involve conflict reso- 
lution—but in highly different contexts. 

In response to the perceived shortcomings of the 
PAQ, Cornelius and Hakel (1978) developed the Job 
Element Inventory (JEI). Similar in format to the 
PAQ, the JEI consists of 153 items with tenth-grade 
reading level. The most significant innovation is 
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that the instrument is simple enough to be com- 
pleted by job incumbents. Since professionally 
trained job analysts are not needed, the JEL is highly 
cost-efficient. In a factor-analytic study of the 
JEI with 2,000 Coast Guard incumbents, Harvey, 
Friedman, Hakel, and Cornelius (1988) discovered 
that the JEI factors closely paralleled those of the 
PAQ. The JEI appears to accomplish the same goals 
as the PAQ, but with much greater speed and 
efficiency. 

Several standardized job analysis question- 
naires have appeared in recent years, as outlined in 
Table 11.2. These instruments are really still in 
their infancy. However, with external regulations 
and the threat of discrimination lawsuits always 
lurking in the background, we can forecast confi- 


TABLE 11.2 Job Analysis Questionnaires 


Professional and Managerial Position Questionnaire 
(PMPQ) (Mitchell & McCormick, 1979) 

A managerial-oriented instrument that assesses inter- 

personal activities, planning and decision making, 

processing information, technical activities, and pre- 

requisite background factors such as relevant experi- 

ence, special training, and second-language usage. 


Occupation Analysis Inventory (OAI) 
(Cunningham, Boese, Neeb, & Pass, 1983) 

A general-purpose 617-item instrument that assesses 

biological/health-related activities, botanical activities, 

electrical-electronic repair, working with animals, and 

food preparation/processing. 


Fleishman Job Analysis Survey (F-JAS) 
(Fleishman & Reilly, 1992) 

A set of 52 rating scales relevant to work performance 

in a wide range of vocations; each ability scale is a 

7-point rating scale with clearly defined anchoring 

points in the low, middle, and high ranges. 


Common Metric Questionnaire (CMQ) 

(Harvey, 1990) 
The CMQ was designed to be applicable to all jobs; 
includes behaviorally referenced scales so that ratings 
mean the same thing across different jobs, e.g., inter- 
actions with others might be rated on: coordinate/ 
schedule their activities, sell to them or persuade 
them, train/instruct/educate them. 





INDUSTRIAL AND ORGANIZATIONAL ASSESSMENT 


dently that job analysis questionnaires will become 
commonplace—at least in large corporations that 
must assign diverse employees to specific jobs. 





||| THE ROLE OF TESTING 


I IN PERSONNEL SELECTION 
Complexities of Personnel Selection 





Based upon the assumption that psychological tests 
and assessments can provide valuable information 
about potential job performance, many businesses, 
corporations, and military settings have used test 
scores and assessment results for personnel selec- 
tion. As Guion (1998) has noted, I/O research on 
personnel selection has emphasized criterion-re- 
lated validity as opposed to content or construct va- 
lidity. These other approaches to validity are 
certainly relevant, but usually take a back seat to 
criterion-related validity, which preaches that cur- 
rent assessment results must predict the future cri- 
terion of job performance. 

From the standpoint of criterion-related valid- 
ity, the logic of personnel selection is seductively 
simple. Whether in a large corporation or a small 
business, those who select employees should use 
tests or assessments that have documented, strong 
correlations with the criterion of job performance, 
and then hire the individuals who obtain the high- 
est test scores or show the strongest assessment re- 
sults. What could be simpler than that? 

Unfortunately, the real-world application of 
employment selection procedures is fraught with 
psychometric complexities and legal pitfalls. The 
psychometric intricacies arise, in large measure, 
from the fact that job behavior is rarely simple, uni- 
dimensional behavior. There are some exceptions 
(such as assembly line production) but the general 
rule in our postindustrial society is that job behav- 
ior is complex, multidimensional behavior. Even 
jobs that seem simple may be highly complex. For 
example, consider what is required for effective 
performance in the delivery of the U.S. mail. The 
individual who delivers your mail six days a week 
must do more than merely place it in your mailbox. 
He or she must accurately sort mail on the run, in- 
terpret and enforce government regulations about 
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package size, manage pesky and even dangerous 
animals, recognize and avoid physical dangers, and 
exercise effective interpersonal skills in dealing 
with the public, to cite just a few of the complexi- 
ties of this position. 

Personnel selection is therefore a fuzzy, con- 
ditional, and uncertain task. Guion (1991) has 
highlighted the difficulty in predicting complex be- 
havior from simple tests. For one thing, complex 
behavior is, in part, a function of the situation. This 
means that even an optimal selection approach may 
not be valid for all candidates. Quite clearly, per- 
sonnel selection is not a simple matter of adminis- 
tering tests and consulting cutoff scores. 

We must also acknowledge the profound impact 
of legal and regulatory edicts upon I/O testing prac- 
tices. Given that such practices may have weighty 
consequences—determining who is hired or pro- 
moted, for example—it is not surprising to learn that 
I/O testing practices are rigorously constrained by 
legal precedents and regulatory mandates. We re- 
view these issues in the final section of this chapter. 


Approaches to Personnel Selection 


Acknowledging that the interview is a widely used 
form of personnel assessment, it is safe to conclude 
that psychological assessment is almost a universal 
practice in hiring decisions. Even by a narrow def- 
inition that includes only paper-and-pencil mea- 
sures, at least two-thirds of the companies in the 
United States engage in personnel testing (Schmitt 
& Robertson, 1990). For purposes of personnel se- 
lection, the I/O psychologist may recommend one 
or more of the following: 


e Autobiographical data 

« Employment interview 

e Cognitive ability tests 

e Personality, temperament, and motivation tests 
e Paper-and-pencil integrity tests 

« Sensory, physical, and dexterity tests 

e Work sample and situational tests 


We turn now to a brief survey of typical tests and 
assessment approaches within each of these cate- 


gories. We close this topic with a discussion of legal 
issues in personnel testing. 


|| aurosiocrarnıcaı DATA 


According to Owens (1976), application forms that 
request personal and work history as well as de- 
mographic data such as age and marital status have 
been used in industry since at least 1894. Objective 
or scorable autobiographical data—sometimes 
called biodata—are typically secured by means of 
a structured form variously referred to as a bio- 
graphical information blank, biographical data 
form, application blank, interview guide, individ- 
ual background survey, or similar device. Although 
the lay public may not recognize these devices as 
true tests with predictive power, I/O psychologists 
have known for some time that biodata furnish an 
exceptionally powerful basis for the prediction of 
employee performance (Drakeley, Herriot, & 
Jones, 1988; Cascio, 1976; Ghiselli, 1966; Hunter 
& Hunter, 1984; Reilly & Chao, 1982). An impor- 
tant milestone in the biodata approach is the recent 
publication of the Biodata Handbook, a thorough 
survey of the use of biographical information in se- 
lection and the prediction of performance (Stokes, 
Mumford, & Owens, 1994). 

The rationale for the biodata approach is that fu- 
ture work-related behavior can be predicted from 
past choices and accomplishments. Biodata have 
predictive power because certain character traits 
which are essential for success also are stable and 
enduring. The consistently ambitious youth with ac- 
colades and accomplishments in high school is 
likely to continue this pattern into adulthood. Thus, 
the job applicant who served as editor of the high 
school newspaper—and who answers a biodata item 
to this effect—is probably a better candidate for cor- 
porate management than the applicant who reports 
no extracurricular activities on a biodata form. 


The Nature of Biodata 


Biodata items usually call for “factual” data; how- 
ever, items that tap attitudes, feelings, and value 
judgments are sometimes included. Except for 
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demographic data such as age and marital status, 
biodata items always refer to past accomplishments 
and events. Some examples of biodata items are 
listed in Table 11.3. 

Once biodata are collected, the I/O psycholo- 
gist must devise a means for predicting job perfor- 
mance from this information. The most common 
strategy is a form of empirical keying not unlike 
that used in personality testing. From a large sam- 
ple of workers who are already hired, the I/O psy- 
chologist designates a successful group and an 
unsuccessful group, based on performance, tenure, 
salary, or supervisor ratings. Individual biodata 
items are then contrasted for these two groups to 
determine which items most accurately discrimi- 
nate between successful and unsuccessful workers. 
Items that are strongly discriminative are assigned 
large weights in the scoring scheme. New appli- 
cants who respond to items in the keyed direction 
therefore receive high scores on the biodata instru- 
ment and are predicted to succeed. Cross validation 
of the scoring scheme on a second sample of suc- 
cessful and unsuccessful workers is a crucial step 
in guaranteeing the validity of the biodata selection 
method. Readers who wish to pursue the details of 


TABLE 11.3 Examples of Biodata Questions 


How long have you lived at your present address? 

What is your highest educational degree? 

How old were you when you obtained your first 
paying job? 

How many books (not work-related) did you read last 
month? 

At what age did you get your driver’s license? 

In high school, did you hold a class office? 

How punctual are you in arriving at work? 

What job do you think you will hold in ten years? 


How nrany hours do you watch television in a typical 
week? 


Have you ever been fired from a job? 

How many hours a week do you spend on hobbies? 
How many job projects did you manage in the last year? 
In college, did you participate in a sports team? 

How many hours per month do you volunteer? 

What is your attitude toward others who use marijuana? 
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empirical scoring methods for biodata instruments 
should consult Murphy and Davidshofer (1988), 
Mount, Witt, and Barrick (2002), and Stokes and 
Cooper (2001). 


The Validity of Biodata 


The validity of biodata has been surveyed by sev- 
eral reviewers, with generally positive findings 
(Mumford & Owens, 1987; Reilly & Chao, 1982; 
Rothstein, Schmidt, Erwin, Owens, & Sparks, 
1990; Russell, Mattson, Devlin, & Atwater, 1990; 
Stokes et al., 1994). An early study by Cascio 
(1976) is typical of the findings. He used a very 
simple biodata instrument—a weighted combina- 
tion of 10 application blank items—to predict 
turnover for female clerical personnel in a medium- 
sized insurance company. The cross-validated cor- 
relations between biodata score and length of 
tenure were .58 for minorities and .56 for nonmi- 
norities.! Drakeley et al. (1988) compared biodata 
and cognitive ability tests as predictors of training 
success. Biodata scores possessed the same predic- 
tive validity as the cognitive tests. Furthermore, 
when added to the regression equation, the biodata 
information improved the predictive accuracy of 
the cognitive tests. 

In an extensive research survey, Reilly and 
Chao (1982) compared eight selection procedures 
as to validity and adverse impact on minorities. The 
procedures were biodata, peer evaluation, inter- 
views, self-assessments, reference checks, acade- 
mic achievement, expert judgment, and projective 
techniques. Noting that properly standardized abil- 
ity tests provide the fairest and most valid selection 
procedure, Reilly and Chao (1982) concluded that 
only biodata and peer evaluations had validities 
substantially equal to those of standardized tests. 
For example, in the prediction of sales productiv- 


1. The curious reader may wish to know which 10 biodata items 
could possess such predictive power. The items were age, mar- 
ital status, children’s age, education, tenure on previous job, pre- 
vious salary, friend or relative in company, location of residence, 
home ownership, and length of time at present address. Unfor- 
tunately, Cascio (1976) does not reveal the relative weights or 
direction of scoring for the items. 
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ity, the average validity coefficient of biodata was 
a very healthy .62. 

Certain cautions need to be mentioned with re- 
spect to biodata approaches in personnel selection. 
Employers may be prohibited by law from asking 
questions about age, race, sex, religion, and oiher 
personal issues—even when such biodata can be 
shown empirically to predict job performance. 
Also, even though the incidence of faking is very 
low, there is no doubt that shrewd respondents can 
falsify results in a favorable direction. For exam- 
ple, Schmitt and Kunce (2002) addressed the con- 
cern that some examinees might distort their 
answers to biodata items in a socially desirable di- 
rection. These researchers compared the scores ob- 
tained when examinees were asked to elaborate 
their biodata responses versus when they were not. 
Requiring elaborated answers reduced the scores 
on biodata items; that is, it appears that respondents 
were more truthful when asked to provide corrob- 
orating details to their written responses. As with 
any measurement instrument, biodata items will 
need periodic restandardization. Finally, a potential 
drawback to the biodata approach is that, by its 
nature, this method captures the organizational sta- 
tus quo and might therefore squelch innovation. 
Becker and Colquitt (1992) discuss precautions in 
the development of biodata forms. 

There is little doubt, then, that purely objective 
biodata information can predict aspects of job per- 
formance with fair accuracy. However, employers 
are perhaps more likely to rely upon subjective in- 
formation such as interview impressions when 
making decisions about hiring. We turn now to re- 
search on the validity of the employment interview 
in the selection process. 


i THE EMPLOYMENT INTERVIEW 


The employment interview is usually only one part 
of the evaluation process, but most administrators 
regard it as the crucial make-or-break component of 
hiring (Miner & Miner, 1978). Landy (1985) re- 
ports that companies typically interview from five 
to twenty persons for each person hired! Consider- 
ing the importance of the interview and its tremen- 


dous costs to industry and the professions, the 
reader should not be surprised to learn that thou- 
sands of studies address the reliability and validity 
of theinterview. We can only highlight a few trends 
here; more detailed reviews can be found in Con- 
way, Jako, and Goodman (1995), Guion (1998), 
and Schmitt and Robertson (1990). 

Early studies of interview reliability were quite 
sobering. In various studies and reviews, reliability 
was typically assessed by correlating evaluations of 
different interviewers who had access to the same 
job candidates (Wagner, 1949; Mayfield, 1964; Ul- 
rich & Trumbo, 1965). The interrater reliability from 
dozens of these early studies was typically in the 
mid-.50s, much too low to provide accurate assess- 
ments of job candidates. This research also revealed 
that interviewers were prone to halo bias and other 
distorting influences upon their perceptions of can- 
didates. Halo bias—discussed in the next topic—is 
the tendency to rate a candidate high or low on all 
dimensions because of a global impression. 

Later, researchers discovered that interview re- 
liability could be increased substantially if the in- 
terview was jointly conducted by a panel instead of 
a single interviewer (Landy, 1996). In addition, 
structured interviews in which each candidate was 
asked the same questions by each interviewer also 
proved to be much more reliable than unstructured 
interviews (Borman, Hanson, & Hedge, 1997; 
Campion, Pursell, & Brown, 1988). In these stud- 
ies, reliabilities in the .70s and higher were found. 

Research on validity of the interview has fol- 
lowed the same evolutionary course noted for reli- 
ability: Early research that examined unstructured 
interviews was quite pessimistic, while later re- 
search using structured approaches produced more 
promising findings. In these studies, interview va- 
lidity was typically assessed by correlating inter- 
view judgments with some measure of on-the-job 
performance. Early studies of interview validity 
yielded almost uniformly dismal results, with typ- 
ical validity coefficients hovering in the mid-.20s 
(Arvey & Campion, 1982). 

Mindful that interviews are seldom used in iso- 
lation, early researchers also investigated incremen- 
tal validity, which is the potential increase in validity 
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when the interview is used in conjunction with other 
information. These studies were predicated on the 
optimistic assumption that the interview would con- 
tribute positively to candidate evaluation when used 
alongside objective test scores and background data. 
Unfortunately, the initial findings were almost en- 
tirely unsupportive (Landy, 1996). 

In some instances, attempts to prove incre- 
mental validity of the interview demonstrated just 
the opposite, what might be called decremental 
validity. For example, Kelly and Fiske (1951) es- 
tablished that interview information actually de- 
creased the validity of graduate student evaluations. 
In this early and classic study, the task was to pre- 
dict the academic performance of more than 500 
graduate students in psychology. Various combina- 
tions of credentials (a form of biodata), objective 
test scores, and interview were used as the basis for 
clinical predictions of academic performance. The 
validity coefficients are reported in Table 11.4. The 
reader will notice that credentials alone provided a 
much better basis for prediction than credentials 
plus a one-hour interview. The best predictions 
were based upon credentials and objective test 
scores; adding a two-hour interview to this infor- 
mation actually decreased the accuracy of predic- 
tions. These findings highlighted the superiority of 
actuarial prediction (based on empirically derived 
formulas) over clinical prediction (based on sub- 


TABLE 11.4 Validity Coefficients for Ratings 
Based on Various Combinations of Information 





Correlation with 


Academic 
Basis for Rating Performance 

Credentials alone .26 
Credentials and one-hour .13 
interview 

Credentials and objective test .36 
scores 

Credentials, test scores, and 192 


two-hour interview 





Source; Based on data in Kelly, E. L., & Fiske, D. W. (1951). 
The prediction of performance in clinical psychology. Ann 
Arbor: University of Michigan Press. 
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jective impressions). We pursue the actuarial versus 
clinical debate in the last chapter of this text. 

Recent studies using carefully structured inter- 
views, including situational interviews, provide a 
more positive picture of interview validity (Bor- 
man, Hanson, & Hedge, 1997; Maurer &. Fay, 
1988; Schmitt & Robertson, 1990). When the find- 
ings are corrected for restriction of range and unre- 
liability of job performance ratings, the mean 
validity coefficient for structured interviews turns 
out to be an impressive .63 (Wiesner & Cronshaw, 
1988). In a recent meta-analysis, Conway, Jako, 
and Goodman (1995). concluded that the upper 
limit for the validity coefficient of structured inter- 
views was .67, whereas for unstructured interviews 
the validity coefficient was only .34. Additional 
reasons for preferring structured interviews include 
their legal defensibility in the event of litigation 
(Williamson, Campion, Malo, and others 1997) 
and, surprisingly, their minimal bias across differ- 
ent racial groups of applicants (Huffcutt & Roth, 
1998). 

In order to reach acceptable levels of reliability 
and validity, structured interviews must be de- 
signed with painstaking care. Consider the protocol 
used by Motowidlo et al. (1992) in their research on 
structured interviews for management and market- 
ing positions in eight telecommunications compa- 
nies. Their interview format was based upon a 
careful analysis of critical incidents in marketing 
and management. Prospective employees were 
asked a set of standard questions about how they 
had handled past situations similar to these critical 
incidents. Interviewers were trained to ask discre- 
tionary probing questions for details about how the 
applicants handled these situations. Throughout, 
the interviewers took copious notes. Applicants 
were then rated on scales anchored with behavioral 
illustrations. Finally, these ratings were combined 
to yield a total interview score used in selection 
decisions. 

In summary, under carefully designed condi- 
tions, the interview can provide a reliable and valid 
basis for personnel selection. However, as noted by 
Schmitt and Robertson (1990), the prerequisite 
conditions for interview validity are not always 
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available. Guion (1998) has expressed the same 
point: 


A large body of research on interviewing has, in my 
opinion, given too little practical information about 
how to structure an interview, how to conduct it, 
and how to use it as an assessment device. I think I 
know from the research that (a) interviews can be 
valid, (b) for validity they require structuring and 
standardization, (c) that structure, like many other 
things, can be carried too far, (d) that without care- 
fully planned structure (and maybe even with it) in- 
terviewers talk too much, and (e) that the interviews 
made routinely in nearly every organization could 
be vastly improved if interviewers were aware of 
and used these conclusions. There is more to be 
learned and applied. (p. 624) 


The essential problem is that each interviewer may 
evaluate only a small number of applicants, so that 
standardization of interviewer ratings is not always 
realistic. While the interview is potentially valid as 
a selection technique, in its common, unstructured 
application there is probably substantial reason for 
concern. 

Why are interviews used? If the typical, un- 
structured interview is so unreliable and ineffec- 
tual a basis for job candidate evaluation, why do 
administrators continue to value interviews’ so 
highly? In their review of the employment inter- 
view, Arvey and Campion (1982) outline several 
reasons for the persistence of the interview, in- 
cluding practical considerations such as the need to 
sell the candidate on the job, and social reasons such 
as the susceptibility of interviewers to the illusion of 
personal validity. Others have emphasized the im- 
portance of the interview for assessing a good fit be- 
tween applicant and organization (Adams, Elacqua, 
& Colarelli, 1994; Latham & Skarlicki, 1995). 

It is difficult to imagine that most employers 
would ever eliminate entirely the interview from 
the screening and selection process. After all, the 
interview does serve the simple human need of 
meeting the persons who might be hired. However, 
based on 50 years worth of research, it is evident 
that biodata and objective tests often provide a 
more powerful basis for candidate evaluation and 
selection than unstructured interviews. 


i COGNITIVE ABILITY TESTS 


Cognitive ability can refer either to a general 
construct akin to intelligence or to a variety of 
specific constructs such as verbal skills, numerical 
ability, spatial perception, or perceptual speed 
(Kline, 1993). Tests of general cognitive ability 
and measures of specific cognitive skills have many 
applications in personnel selection, evaluation, and 
screening. Such tests are quick, inexpensive, and 
easy to interpret. A vast body of empirical re- 
search offers modest to strong support for the 
validity and fairness of standardized ability tests 
in personnel selection (Gottfredson, 1986). Cer- 
tainly, it seems clear that ability tests often provide 
an excellent basis for job selection, at least accord- 
ing to objective criteria such as the capacity to pre- 
dict job performance. For example, Hunter and 
Hunter (1984) conducted a meta-analysis of re- 
search on the prediction of job performance and 
concluded that for entry-level jobs no predictor (ex- 
cept for the work sample) exceeded the validity of 
ability tests, which showed a mean validity coeffi- 
cient of .54. 

Even so, a significant concern with the use of 
cognitive ability tests for personnel selection is that 
these instruments may result in an adverse impact 
on minority groups. Adverse impact is a legal term 
(discussed later in this chapter) referring to the dis- 
proportionate selection of white candidates over 
minority candidates. Most authorities in personnel 
psychology recognize that cognitive tests play an 
essential role in applicant selection; nonetheless, 
these experts also affirm that cognitive tests provide 
maximum benefit (and minimum adverse impact) 
when combined with other approaches such as bio- 
data. Selection decisions never should be made ex- 
clusively on the basis of cognitive test results 
(Robertson & Smith, 2001; Outtz, 2002). 

An ongoing debate within I/O psychology is 
whether employment testing is best accomplished 
with highly specific ability tests or with measures 
of general cognitive ability. The weight of the evi- 
dence seems to support the conclusion that a gen- 
eral factor of intelligence (the so-called g factor) is 
usually a better predictor of training and job success 
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than are scores on specific cognitive measures— 
even when several specific cognitive measures are 
used in combination. Of course, this conclusion 
runs counter to common sense and anecdotal evi- 
dence. For example, Kline (1993) offers the fol- 
lowing vignette: 


The point is that the g factors are important but so 
also are these other factors. For example, high g is 
necessary to be a good engineer and to be a good 
journalist. However for the former high spatial 
ability is also required, a factor which confers little 
advantage on a journalist. For her or him, however, 
high verbal ability is obviously useful. 


Curiously, empirical research provides only mixed 
support for this position (Gottfredson, 1986; Larson 
& Wolfe, 1995; Ree, Earles, & Teachout, 1994). Al- 
though the topic continues to be debated, most stud- 
ies support the primacy of g in personnel selection 
(Borman et al., 1997). Perhaps the reason that g usu- 
ally works better than specific cognitive factors in 
predicting job performance is that most jobs are fac- 
torially complex in their requirements, stereotypes 
notwithstanding (Guion, 1998). For example, the 
successful engineer must explain his or her ideas to 
others and so needs verbal ability as well as spatial 
skills: Since measures of general cognitive ability 
tap many specific cognitive skills, a general test 
often predicts performance in complex jobs as well 
as, or better than, measures of specific skills. 
Literally hundreds of cognitive ability tests are 
available for personnel selection, so it is not feasi- 
ble to survey the entire range of instruments here. 
Instead, we will highlight three representative tests: 
one that measures general cognitive ability, a sec- 
ond that is germane to assessment of mechanical 
abilities, and a third that taps a highly specific facet 
of clerical work. The three instruments chosen for 
review—the Wonderlic Personnel Evaluation, the 
Bennett Mechanical Comprehension Test, and the 
Minnesota Clerical Test—are merely exemplars of 
the hundreds of cognitive ability tests available for 
personnel selection. All three tests are often used in 
business settings and therefore worthy of specific 
mention. Representative cognitive ability tests en- 
countered in personnel selection are listed in Table 
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TABLE 11.5 Representative Cognitive Ability 
Tests Used in Personnel Selection 


General Ability Tests 

Shipley Institute of Living Scale 
Wonderlic Personnel Test 

Wesman Personnel Classification Test 
Personnel Tests for Industry 


Multiple Aptitude Test Batteries 

General Aptitude Test Battery 

Armed Services Vocational Aptitude Battery 
Differential Aptitude Test 

Employee Aptitude Survey 


Mechanical Aptitude Tests 

Bennett Mechanical Comprehension Test 
Minnesota Spatial Relations Test 

Revised Minnesota Paper Form Board Test 
SRA Mechanical Aptitudes 


Motor Ability Tests 

Crawford Small Parts Dexterity Test 
Purdue Pegboard 

Hand-Tool Dexterity Test 
Stromberg Dexterity Test 


Clerical Tests 
Minnesota Clerical Test 
Clerical Abilities Battery 
General Clerical Test 
SRA Clerical Aptitudes 





Note: SRA denotes Science Research Associates. These tests are 
reviewed in the Mental Measurements Yearbook series. 


11.5. Some classic viewpoints on cognitive ability 
testing for personnel selection are found in Ghiselli 
(1966, 1973), Hunter and Hunter (1984), and Reilly 
and Chao (1982). Contemporary discussion of this 
issue is provided by Borman et al. (1997), Guion 
(1998), and Murphy (1996). 


Wonderlic Personnel Test 


Even though it is described as a personnel test, the 
Wonderlic Personnel Test (WPT) is really a group 
test of general mental ability (Hunter, 1989; Won- 
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derlic, 1983). What makes this instrument some- 
what of an institution in personnel testing is its for- 
mat (50 multiple-choice items), its brevity (a 
12-minute time limit), and its numerous parallel 
forms (16 at last count). Item types on the Wonder- 
lic are quite varied and include vocabulary, sentence 
rearrangement, arithmetic problem solving, logical 
induction, and interpretation of proverbs. The fol- 
lowing items capture the flavor of the Wonderlic: 


1. REGRESS is the opposite of 
a. ingest b. advance 
c. close d. open 
2. Two men buy a car which costs $550; X pays 
$50 more than Y. How much did X pay? 
a. $500 b. $300 c. $400 d. $275 
3. HEFT CLEFT—Do these words have 
a. similar meaning b. opposite meaning 
c. neither similar nor opposite meaning 


The reliability of the WPT is quite impressive, 
especially considering the brevity of the instrument. 
Internal consistency reliabilities typically reach .90, 
while alternative-form reliabilities usually exceed 
.90. Normative data are available on 126,000 adults 
ages 20 to 65 years. Regarding validity, if the WPT 
is considered a brief test of general mental ability, 
the findings are quite positive (Dodrill & Warner, 
1988). For example, Dodrill (1981) reports a corre- 
lation of .91 between scores on the WPT and scores 
on the WAIS. This correlation is as high as that 
found between any two mainstream tests of general 
intelligence. Bell, Matthews, Lassister, and Leverett 
(2002) reported strong congruence between the 
WPT and the Kaufman Adolescent and Adult In- 
telligence Test in a sample of adults. Hawkins, 
Faraone, Pepple, Seidman, and Tsuang (1990) re- 
port a similar correlation (r = .92) between WPT 
and WAIS-R IQ for 18 chronically ill psychiatric 
patients. However, in their study, one subject was 
unable to manage the format of the WPT, suggest- 
ing that severe visuospatial impairment can invali- 
date the test. A recent innovation with the 
Wonderlic is the addition of four forms of the test 
(called the Scholastic Level Exam) for use in edu- 
cational selection and counseling. The validity of 


the Wonderlic in educational settings is not yet 
firmly established(Belcher, 1992). 

Reviewers of the WPT do raise some concerns 
about the interpretive guidelines found in the test 
manual (Geisinger, 2001). For example, the man- 
ual suggests that persons who earn raw scores be- 
tween 16 and 22 have a limited capacity for 
anything other than routine tasks. These WPT 
scores correspond to IQs of 93 to 104; that is, such 
persons are completely within the normal range of 
intelligence. Thus, the interpretive guidelines seem 
both arbitrary and unnecessarily restrictive. The 
manual also lists cutting scores used in industry for 
over 75 occupations, which raises the specter of an 
undertrained personnel officer overinterpreting in- 
dividual scores. This would be especially problem- 
atical to racial minorities, since race differences on 
the WPT are significant (Geisinger, 2001). In fact, 
Chan (1997) reported that African American un- 
dergraduates perceived the WPT to be less valid 
than white students as a predictive employment 
measure. 

Another concern about the Wonderlic is that ex- 
aminees whose native language is not English will 
be unfairly penalized on the test (Belcher, 1992). 
The Wonderlic is a speeded test. In fact, it has such 
a heavy reliance on speed that points are added for 
subjects age 30 and older to compensate for the 
well-known decrement in speed that accompanies 
normal aging. However, no accommodation is made 
for nonnative English speakers who might also per- 
form more slowly. One solution to the various is- 
sues of fairness cited would be to provide norms for 
untimed performance on the Wonderlic. However, 
the publishers have resisted this suggestion. 


Bennett Mechanical Comprehension Test 


In many trades and occupations, the understanding 
of mechanical principles is a prerequisite to 
successful performance. Automotive mechanics, 
plumbers, mechanical engineers, trade school ap- 
plicants, and persons in many other “hands-on” vo- 
cations need to comprehend basic mechanical 
principles in order to succeed in their fields. In 
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these cases, a useful instrument for occupational 
testing is the Bennett Mechanical Comprehension 
Test (BMCT). This test consists of pictures about 
which the examinee must answer straightforward 
questions. The situations depicted emphasize basic 
mechanical principles that might be encountered in 
everyday life. For example, a series of belts and fly- 
wheels might be depicted, and the examinee would 
be asked to discern the relative revolutions per 
minute of two flywheels. The test includes two 
equivalent forms (S and T). 

The BMCT has been widely used since World 
War II for military and civilian testing, so an ex- 
tensive body of technical and validity data exist for 
this instrument. Split-half reliability coefficients 
range from the .80s to the low .90s. Comprehensive 
normative data are provided for several groups. 
Based on a huge body of earlier research, the con- 
current and predictive validity ofthe BMCT appear 
to be well established (Wing, 1992). For example, 
in one study with 175 employees, the correlation 
between the BMCT and the DAT Mechanical Rea- 
soning subtest was an impressive .80. An intriguing 
finding is that the test proved to be one of the best 
predictors of pilot success during World War II 
(Ghiselli, 1966). 

In spite of its psychometric excellence, the 
BMCT is in need of modernization. The test looks 
old and many items are dated. By contemporary 
standards, some BMCT items are sexist or poten- 
tially offensive to minorities (Wing, 1992). The 
problem with dated and offensive test items is that 
they can subtly bias test scores. Modernization of 
the BMCT would be a straightforward project that 
could increase the acceptability of the test to 
women and minorities while simultaneously pre- 
serving its psychometric excellence. 


Minnesota Clerical Test 


The Minnesota Clerical Test (MCT), which pur- 
ports to measure perceptual speed and accuracy rel- 
evant to clerical work, has remained essentially 
unchanged in format since its introduction in 1931, 
although the norms have undergone several revi- 
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TABLE 11.6 Items Similar to Those Found 
on the Minnesota Clerical Test 








Number Comparison 

1. 3496482 3495482 

2. 17439903 ____ 17439903 

3. 84023971 ____. 84023971 

4. 910386294 910368294 

Name Comparison 

1. New York Globe — ___ _New York Globe 
2. Brownell Seed _____ Brownel Seed 
3. John G. Smith ____ JohnG Smith 
4. Daniel Gregory _____ Daniel Gregory 





sions, most recently in 1979 (Andrew, Peterson, & 
Longstaff, 1979). The MCT is divided into two 
subtests: Number Comparison and Name Compar- 
ison. Each subtest consists of 100 identical and 100 
dissimilar pairs of digit or letter combinations 
(Table 11.6). The dissimilar pairs generally differ in 
regard to only one digit or letter, so the comparison 
task is challenging. The examinee is required to 
check only the identical pairs, which are randomly 
intermixed with dissimilar pairs. The score depends 
predominantly upon speed, although the examinee 
is penalized for incorrect items (errors are sub- 
tracted from the number of correct items). 

The reliability of the MCT is acceptable, with 
reported stability coefficients in the range of .81 to 
.87 (Andrew, Peterson, & Longstaff, 1979). The 
manual also reports a wealth of validity data, in- 
cluding some findings that are not altogether flat- 
tering. In these studies, the MCT was correlated 
with measures of job performance, measures of 
training outcome, and scores from related tests. The 
job performance of directory assistants, clerks, 
clerk-typists, and bank tellers was correlated sig- 
nificantly but not robustly with scores on the MCT. 
The MCT is also highly correlated with other tests 
of clerical ability. 

Nonetheless, questions still remain about the 
validity and applicability of the MCT. Ryan (1985) 
notes that the manual lacks a discussion of the sig- 
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nificant versus the nonsignificant validity studies. 
In addition, the MCT authors fail to provide de- 
tailed information concerning the specific attributes 
of the jobs, tests, and courses used as criterion mea- 
sures in the reported validity studies. For this rea- 
son, it is difficult to surmise exactly what the MCT 
measures. Both Thomas (1985) and Ryan (1985) 
complain that the new 1979 norms are difficult to 
use because the MCT authors provide so little in- 
formation on how the various norm groups were 
constituted. Thus, even though the revised MCT 
manual presents new norms for 10 vocational cat- 
egories, the test user may not be sure which norm 
group applies to his or her setting. Because of the 
marked differences in performance between the 
norm groups, the vagueness of definition poses a 
significant problem to potential users of this test. 
PERSONALITY AND 


N) TEMPERAMENT TESTS 


Several personality tests provide a useful basis for 
employee selection, when used appropriately. Cit- 





ing bygone practices, Muchinsky (1990) warns 
against the wreckless use of personality tests for 
personnel selection: 


Personality inventories such as the MMPI 
[Minnesota Multiphasic Personality Inventory] 
were used for many years for personnel 
selection—in fact, overused or misused. They 
were used indiscriminately to assess a candi- 
date’s personality, even when there was no 
established relation between test scores and 
job success. Soon personality inventories came 
under attack. 


Of course, it is essential that personality tests must 
possess a demonstrated link to job performance 
before they are used in personnel selection. Un- 
fortunately, the literature on this topic is somewhat 
sobering for most personality scales. To illustrate 
this point, we have summarized key findings of 
a review by Hough, Eaton, Dunnette, Kamp, and 
McCloy (1990) in Table 11.7. The table reports 
the mean correlations between various categories 
of personality measures and several job-related 
criteria. The data are based upon hundreds of 


TABLE 11.7 Summary of Criterion-Related Validity Studies Using Personality 


Tests to Predict Job Criteria 





Job Criteria 
Job Job Substance 
Personality Construct Involvement Proficiency Delinquency Abuse 
Surgency* .04 .04 -.29 .06 
Affiliation .06 -.01 n/a -.03 
Adjustment 13 13 —.43 —.07 
Agreeableness .02 —.01 -.31 —.04 
Dependability .17 I —.27 —.28 
Intellectance** —.10 . 01 —.24 .18 
Achievement .24 n/a —.35 n/a 
Masculinity .10 n/a .02 -.18 
Locus of control aed n/a n/a n/a 





Note: n/a signifies an absence of research on this predictor-criterion relationship. 
*Personality trait characterized by sociability, cheerfulness, and social responsiveness. 
**Personality trait characterized by intellectual analysis and thinking things through. 


Source: Adapted with permission from Hough, L. M., Eaton, N., Dunnette, M., Kamp, J., & McCloy, R. 
(1990). Criterion-related validities of personality constructs and the effect of response distortion on those 


validities [Monograph]. Journal of Applied Psychology, 75, 581-595. 
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published studies during the time period 1960 to 
1984. 

The reader will notice that most of the cor- 
relations are too low to be of practical value. For 
example, the mean correlation between Adjust- 
ment scores (from various personality scales) and 
Job Proficiency (usually supervisor ratings) is only 
.13. Data for this cell are derived from 146 differ- 
ent studies, so the results are highly stable and in- 
dicate that, on average, measures of psychological 
adjustment are very, very weak predictors of job 
performance. Yet, there is some cause for limited 
optimism from these studies, as indicated by the 
healthy correlation of —.43 between Adjustment and 
Delinquency (e.g., neglect of work duties). Notice, 
too, that measures of Dependability correlate —.28, 
on average, with Substance Abuse. Apparently, 
it is easier to predict some job-related criteria than 
others. 

Certain tests are known to have greater validity 
than others for specific applications in personnel 
selection. For example, the California Psycho- 
logical Inventory (CPI) provides an accurate mea- 
sure of managerial potential (Gough, 1984, 1987). 
Certain scales of the CPI predict overall perfor- 
mance of military academy students reasonably 
well (Blake, Potter, & Sliwak, 1993). The Inwald 
Personality Inventory is well validated as a pre- 
employment screening test for law enforcement 
(Inwald, 1988). The Minnesota Multiphasic Per- 
sonality Inventory also bears mention as a selection 
tool for law enforcement (Hiatt & Hargrave, 1988). 
Finally, the Hogan Personality Inventory (HPI) is 
well validated for prediction of job performance 
in military, hospital, and corporate settings. The 
HPI was based upon the Big Five theory of per- 
sonality (see Topic 13A, Theories and the Mea- 
surement of Personality). This instrument has 
cross-validated criterion-related validities as high 
as .60 for some scales (Hogan, 1986; Hogan & 
Hogan, 1986). Additional job-related applications 
of personality tests are discussed in Topic 14A, 
Self-Report Inventories. Borman et al. (1997) pro- 
vide a good summary of recent studies on this 
topic. 
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Several test publishers have introduced instru- 
ments designed to screen theft-prone individuals 
and other undesirable job candidates such as per- 
sons who are undependable or frequently absent 
from work (Hogan & Hogan, 1989; O’Bannon, 
Goldinger, & Appleby, 1989; Ones & Viswes- 
varan, 1998; Ones, Viswesvaran, & Schmidt, 1993; 
Sackett, Burris, & Callahan, 1989). These tests 
come in two clearly differentiated types: overt in- 
tegrity tests and personality-based measures. We 
will discuss each type separately, concentrating 
upon issues raised by these tests rather than de- 
tailing the merits or demerits of individual instru- 
ments. Table 11.8 lists 20 of the more commonly 
used instruments. 

One problem with integrity tests is that their 
proprietary nature makes it difficult to scrutinize 
them in the same manner as traditional instruments. 
In most cases, scoring keys are available only to in- 
house psychologists, which makes independent re- 
search difficult. Nonetheless, a sizable body of 
research now exists on integrity tests, as discussed 
in the following section on validity. 


Overt Integrity Tests 


Overt integrity tests typically consist of two sec- 
tions. The first is a section dealing with attitudes 
toward theft and other forms of dishonesty such as 
beliefs about extent of employee theft, degree of 
condemnation of theft, endorsement of common ra- 
tionalizations about theft, and perceived ease of 
theft. The second is a section dealing with overt ad- 
missions of theft and other illegal activities such as 
items stolen in the last year, gambling, and drug 
use. The most widely researched tests of this type 
include the Personnel Selection Inventory, the Reid 
Report, and the Stanton Survey. The interested 
reader can find addresses for the publishers of these 
and related instruments in O’Bannon et al. (1989). 

Apparently, overt integrity tests can be more 
easily faked than personality-based integrity tests 
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TABLE 11.8 Commonly Used Integrity Tests 


Overt Integrity Tests 
Accutrac Evaluation System 
Applicant Review 
Compuscan 
Employee Attitude Inventory 
Employee Reliability Inventory 
Integrity Interview 
Orion Survey 
PEOPLE Survey 
Personnel Selection Inventory 
Phase II Profile 
Reid Report and Reid Survey 
Rely 
Stanton Survey 
True Test 


Personality-Based Integrity Tests 
Employment Productivity Index 
Hogan Personnel Selection Series 
Inwald Personality Inventory 
Personnel Decisions, Inc., Employment Inventory 
Personnel Outlook Inventory 
Personnel Reaction Blank 





Note: Publishers and authors of these tests can be found in O’Ban- 
non, R. M., Goldinger, L. A., & Appleby, G. S. (1989). Honesty 
and integrity testing. Atlanta, GA: Applied Information Resources. 


and might therefore be of less value in screening 
dishonest applicants. For example, Ryan and Sack- 
ett (1987) created a generic overt integrity test 
modeled upon existing instruments. The test con- 
tained 52 attitude and 11 admission items. In 
comparison toʻa contrast group asked to respond 
truthfully and another contrast group asked to re- 
spond as job applicants, subjects asked to “fake 
good” produced substantially superior scores (i.e., 
better attitudes and fewer theft admissions). 


Personality-Based Integrity Tests 


The personality-based integrity tests typically do 
not contain obvious references to theft or other 


forms of undesirable employee behavior. These 
measures are more subtle in their approach and 
therefore less offensive to most job candidates. In 
fact, some integrity tests are really nothing more 
than recycled parts of existing personality tests 
such as the California Psychological Inventory 
(CPI). For example, the Personnel Reaction Blank 
(Gough, 1971) is based on those portions of the CPI 
dealing with sociability, dependability, conscien- 
tiousness, internal values, self-restraint, and accep- 
tance of convention. In general, paper-and-pencil 
measures of conscientiousness show strong rela- 
tionships with work-related integrity (Collins & 
Schmidt, 1993). 

One common test development strategy for per- 
sonality-based integrity measures is empirical key- 
ing against a criterion of theft. The problem with 
this approach is the criterion: Theft is rarely appre- 
hended and admissions of theft may or may not be 
accurate. The base rate for employee theft is almost 
impossible to nail down. For example, rates of self- 
reported theft range from 28 to 62 percent in dif- 
ferent studies (Camara & Schneider, 1994). Thus, 
the criterion classification of some research sub- 
jects may not be valid. A second approach is to 
measure broad constructs such as general employee 
deviance as indicated by hostility toward authority, 
thrill seeking, irresponsibility, and social insensi- 
tivity. The instruments that employ this strategy 
show modest ability to predict global criteria such 
as supervisor ratings of effectiveness (Ones et al., 
1993; Sackett et al., 1989). 

A serious problem with most integrity tests is 
the very high fail rate, typically in the 30 to 60 per- 
cent range. Because integrity tests commonly are 
the final hurdle—used only with the small fraction 
of applicants who have the necessary ability and 
relevant experience—organizations that employ 
these tests must be in a position to turn away the 
majority of applicants. Of course, the high fail rate 
is, in part, a consequence of stringent cutting scores 
that cause rejection of potentially valuable em- 
ployees (false positives) alongside real thieves and 
scoundrels (true positives). Actually, this is a 
validity issue, as discussed in the following section. 
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Validity of Integrity Tests 


Publishers of integrity tests have responded to skep- 
tical psychologists and a distrustful public with a 
barrage of criterion-related validity studies. Ones et 
al. (1993) requested data on integrity tests from 
publishers, authors, and colleagues. These sources 
proved highly cooperative: The authors collected 
665 validity coefficients based upon 25 integrity 
tests administered to more than half a million 
employees. Using the intricate procedures of meta- 
analysis, Ones et al. (1993) computed an average 
validity coefficient of .41 when integrity tests were 
used to predict supervisory ratings of job perfor- 
mance. Interestingly, integrity tests predicted global 
disruptive behaviors (theft, illegal activities, absen- 
teeism, tardiness, drug abuse, dismissals for theft, 
and violence on the job) better than they predicted 
employee theft alone. The authors concluded with a 
mild endorsement of these instruments: 


When we started our research on integrity tests, 
we, like many other industrial psychologists, were 
skeptical of integrity tests used in industry. Now, 
on the basis of analyses of a large database consist- 
ing of more than 600 validity coefficients, we con- 
clude that integrity tests have substantial evidence 
of generalizable validity. 


This conclusion is echoed in a series of inge- 
nious studies by Cunningham, Wong, and Barbee 
(1994). Among other supportive findings, these re- 
searchers discovered that integrity test results were 
correlated with returning an overpayment—even 
when subjects were instructed to provide a positive 
impression on the integrity test. 

Other reviewers are more cautious in their 
conclusions. In commenting on recent reviews by 
the American Psychological Association and the 
Office of Technology Assessment, Camara and 
Schneider (1994) concluded that integrity tests do 
not measure up to expectations of experts in as- 
sessment, but that they are probably better than hit- 
or-miss, unstandardized methods used by many 
employers to screen applicants. 

Several concerns remain about integrity tests. 
Publishers may release their instruments to un- 
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qualified users, which is a violation of ethical stan- 
dards of the American Psychological Association. 
A second problem arises from the unknown base 
rate of theft and other undesirable behaviors, which 
makes it difficult to identify optimal cutting scores 
on integrity tests. If cutting scores are too stringent, 
honest job candidates will be disqualified unfairly. 
Conversely, too lenient a cutting score renders the 
testing pointless. A final concern is that situational 
factors may moderate the validity of these instru- 
ments. For example, how a test is portrayed to ex- 
aminees may powerfully affect their responses and 
therefore skew the validity of the instrument. 

Increasingly, the fate of employment testing is 
being decided by legislatures and the courts. For 
test developers and the public, the stakes are high: 
Businesses in the United States administer an esti- 
mated 5 million integrity tests each year. As of 
1994, Massachusetts was the only state to ban in- 
tegrity tests, but legislation was pending in at least 
six other states (Camara & Schneider, 1994). Most 
likely, the use of integrity tests will be increasingly 
restrictive in the years ahead. 

The debate about integrity tests juxtaposes the 
legitimate interests of business against the individ- 
ual rights of workers. Certainly, businesses have a 
right not to hire thieves, drug addicts, and malcon- 
tents. But in pursuing this goal, what is the ultimate 
cost to society of asking millions of job applicants 
about past behaviors involving drugs, alcohol, 
criminal behavior, and other highly personal mat- 
ters? Hanson (1991) has asked rhetorically whether 
society is well served by the current balance of 
power—in which businesses can obtain proprietary 
information about who is seemingly worthy and 
who is not. It appears almost inevitable that Con- 
gress will enter the debate. In 1988, President Rea- 
gan signed into law the Employee Polygraph 
Protection Act, which effectively eliminated poly- 
graph testing in industry (see Topic 10B, Forensic 
Applications of Assessment). Perhaps in the years 
ahead we will see integrity testing sharply curtailed 
by an Employee Integrity Test Protection Act. 
Wanek (1999) provides an excellent review of the 
current state of integrity testing. 
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STRUCTURE AND MEASUREMENT 
OF PSYCHOMOTOR ABILITIES 


Guion (1991) has noted the curious paradox that 
sensory, physical, and dexterity abilities are es- 
sential to success in many work situations, yet 
these abilities are almost never assessed as part of 
employee screening or selection. Not only is the 
lack of such testing unfortunate from the stand- 
points of efficiency and fairness in employee se- 
lection, it also raises concerns about litigation, as 
discussed later. In this section we briefly review 
relevant instruments and issues in the assessment 
of psychomotor abilities. Additional perspectives 
on this topic can be found in Hogan & Quigley 
(1994) and Blakley, Quinones, Crawford, & Jago 
(1994). 

A proper understanding of the sensorimotor 
requirements of specific jobs is important for 
several reasons. First, physically disabled appli- 
cants have a legitimate basis for litigation if they 
are denied employment because of psychomotor 
limitations that have no bearing upon job perfor- 
mance. Employers are therefore in the position of 
needing to document, by means of job analysis 
and/or validity studies, that minimum levels of 
psychomotor skill are needed for efficient job per- 
formance. Conversely, employees may sue the 
company if they are injured or develop health 
problems on a job for which they do not possess a 
necessary level of physical skill. As Guion (1998) 
notes, questions of psychomotor ability are too 
important to leave to the traditional medical 
screening examination. 














Taxonomy of Psychomotor Skills 


The measurement of psychomotor skills has re- 
ceived intense scrutiny over the years by Fleishman 
and his colleagues (Fleishman, 1975; Fleishman & 
Quaintance, 1984). Based upon numerous factor- 
analytic studies, these researchers have identified 
approximately 20 specific ability factors in the psy- 
chomotor domain (Fleishman, 1975). Eleven of 
these factors are designated as perceptual-motor 


TABLE 11.9 A Taxonomy of Psychomotor Skills 


Physical Proficiency 
Abilities 


Perceptual-Motor 
Abilities 
Extent flexibility 
Dynamic flexibility 
Static strength 
Dynamic strength 
Explosive strength 
Trunk strength 


Control precision 
Multilimb coordination 
Response orientation 
Reaction time 

Speed of arm movement 
Rate control (timing) 


Manual dexterity Gross body coordination 
Finger dexterity Equilibrium 
Arm-hand steadiness Stamina 


Wrist-finger speed 
Aiming 





Source: Fleishman, E. A. (1975). Toward a taxonomy of human 
performance. American Psychologist, 30, 1127-1149. 


abilities, whereas nine are referred to as physical 
proficiency abilities (Table 11.9). Fleishman has 
developed rating scales to help employers deter- 
mine the physical abilities required for specific 
jobs—the interested reader may wish to consult 
Fleishman and Mumford (1988) for details. We re- 
strict our coverage here to more traditional tests 
that can be used for assessment of the psychomo- 
tor capacities of potential employees. 

For some occupations—particularly the manual 
trades—tests of psychomotor abilities can be used 
for screening or placement. However, test users 
should demonstrate the relevance of the screening 
criteria to actual job performance (Hogan, 1991). 
Following, we review a sampling of such tests. 


Employee Aptitude Survey 


Two of the ten subtests from the Employee Apti- 
tude Survey (EAS) can be used to assess per- 
ceptual-motor skills (Grimsley, Ruch, Warren, & 
Ford, 1994). The EAS is primarily a cognitive 
battery used for personnel selection in organiza- 
tions. However, Test 3—Visual Pursuit and Test 
9—Manual Speed and Accuracy capture several of 
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the perceptual-motor skills proposed by Fleishman. 
The Visual Pursuit subtest consists of 30 lines in- 
terwoven with other lines. The task is to quickly 
trace each line from beginning to end, as in an elec- 
trical circuit diagram. Each line ends on one of five 
letters—the answer options. Both right (R) and 
wrong (W) answers are used to compute the score 
(S) from the formula S = R — W/4. The subtest has 
a five-minute time limit and would appear to cap- 
ture such crucial perceptual-motor abilities as con- 
trol precision, manual dexterity, and arm-hand 
steadiness (Table 11.9). This subtest is recom- 
mended in screening for positions such as 
draftsperson, design engineer, and technician. The 
Manual Speed and Accuracy subtest consists of 
placing a pencil dot in as many Os as possible in a 
five-minute time limit. Dots marked must be within 
the circle, and errors are heavily penalized. The for- 
mula for the corrected score is S$ = R—(5 x W). This 
subtest measures aiming and other perceptual- 
motor abilities and is recommended in screening 
for positions such as clerical worker, machine op- 
erator, and jobs that require precision or repetitive 
tasks. 

The Technical Report (Grimsley et al., 1994) 
reports good test-retest reliabilities for the subtests 
(mainly in the .80s) and provides normative data for 
80 occupational groups. Summary scores for more 
than 1,000 males employed by a manufacturing 
company are also provided. The brevity of the Vi- 
sual Pursuit and Manual Speed and Accuracy sub- 
tests (5-minute time limit each) makes them highly 
attractive choices for initial screening in technical 
and clerical positions that require precision in 
speeded tasks. For a recent review of the EAS, see 
Muchinsky (2001). 


Minnesota Rate of Manipulation Test 


The Minnesota Rate of Manipulation Test is a ven- 
erable, respected mainstay for the assessment of 
finger-hand-arm dexterity. This test has been used 
since the 1940s for employee screening in a wide 
variety of industrial settings. The test consists of a 
60-hole board with round, fitted blocks that are red 
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on one side and yellow on the other. The five sub- 
tests include Placing, Turning, Displacing, One 
Hand Turning and Placing, and Two Hand Turning 
and Placing. For each subtest, examinees are in- 
structed to place, turn, or move the round blocks in 
specific ways as quickly as possible. Reference 
norms are provided for a sample of 11,000 young 
adults and 3,000 older adults. This test would appear 
to measure the Fleishman variables of speed of arm 
movement, manual dexterity, and finger dexterity. 


Purdue Pegboard 


Another test of motor skills widely used in preem- 
ployment screening is the Purdue Pegboard test. 
This test was devised at Purdue University in 1948 
as an aid in the selection of employees for various 
kinds of manual labor. The test measures dexterity 
for two types of activity: gross movement of hands, 
fingers, and arms; and fingertip dexterity needed in 
assembly tasks. This test would appear to assess a 
complex mixture of the perceptual-motor skills 
identified by Fleishman. A number of additional 
tests of perceptual-motor abilities are described 
briefly in Table 11.10. 

From a practical standpoint, psychomotor tests 
such as the Purdue Pegboard can be very useful in 


TABLE 11.10 Tests of Perceptual-Motor Skills 


Roeder Manipulative Aptitude Test: Sorting and as- 
sembling nuts, bolts, and washers 

Pennsylvania Bi-Manual Worksample: Assembling 
nuts and bolts 

O’Connor Finger Dexterity Test: Hand placement of 
pins in holes, as in assembly line work. 

O’Connor Tweezer Dexterity Test: Use of tweezers in 
placing single pins in -inch diameter holes 

Grooved Pegboard: Rotating and placing pegs in slots 
which have random orientations 

Stromberg Dexterity Test: Placing 54 round, colored 
discs (red, yellow, blue) in a prescribed sequence 





Note: These tests are available from the Lafayette Instrument 
Company, among other sources. 
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establishing minimal levels of performance for 
purposes of preemployment screening. None- 
theless, their utility in employee selection and 
placement is limited. A major problem is that psy- 
chomotor tests typically show substantial practice 
effects. Moreover, these practice effects are highly 
variable from one subject to the next. What this 
means is that the reliability of psychomotor tests is 
typically only modest—test-retest reliabilities 
rarely exceed .80 and are often much lower, in the 
.50s and .60s. 


WORK SAMPLE AND 
SITUATIONAL EXERCISES 


A work sample is a miniature replica of the job for 
which examinees have applied. Muchinsky (2003) 
points out that the I/O psychologist’s goal in devis- 
ing a work sample is “to take the content of a per- 
son’s job, shrink it down to a manageable time 
period, and let applicants demonstrate their ability 
in performing this replica of the job.” Guion (1998) 
has emphasized that work samples need not include 
every aspect of a job, but should focus upon the 
more difficult elements that effectively discrimi- 
nate strong from weak candidates. For example, a 
position as clerk-typist may also include making 
coffee and running errands for the boss. However, 
these are trivial tasks demanding so little skill that 
it would be pointless to include them in a work 
sample. A work sample should test important job 
domains, not the entire job universe. 

Campion (1972) devised an ingenious work 
sample for mechanics that illustrates the preceding 
point. Using the job analysis techniques discussed 
at the beginning of this topic, Campion determined 
that the essence of being a good mechanic was de- 
fined by successful use of tools, accuracy of work, 
and overall mechanical ability. With the help of 
skilled mechanics, he devised a work sample that 
incorporated these job aspects through typical tasks 
such as installing pulleys and repairing a gearbox. 
Points were assigned to component behaviors for 
each task. Example items and their corresponding 
weights were as follows: 


Installing Pulleys Scoring 
and Belts Weights 
1. Checks key before installing against: 

___ shaft 2 
__ pulley 2 
tsari neither 0 

Disassembling and 

Repairing a Gear Box 


10. Removes old bearing with: 
____ press and driver 3 
___ bearing puller 2 
— gear puller 1 
— other 0 

Pressing a Bushing into 

Sprocket and Reaming 

to Fit a Shaft 

4. Checks internal diameter of bushing against 
shaft diameter: 
___ visually 1 
__. hole gauge and 
micrometers 3 
— Vernier calipers, 2 
aar seale 1 
___ does not check 0 


Campion found that the performance of 34 male 
maintenance mechanics on the work sample mea- 
sure was significantly and positively related to the 
supervisor’s evaluations of their work perfor- 
mance, with validity coefficients ranging from .42 
to .66. 

A situational exercise is approximately the 
white-collar equivalent of a work sample. Situa- 
tional exercises are largely used to select persons 
for managerial and professional positions. The 
main difference between a situational exercise and 
a work sample is that the former mirrors only part 
of the job, whereas the latter is a microcosm of the 
entire job (Muchinsky, 1990). In a situational exer- 
cise, the prospective employee is asked to perform 
under circumstances that are highly similar to the 
anticipated work environment. Measures of ac- 
complishment can then be gathered as a basis for 
gauging likely productivity or other aspects of job 
effectiveness. The situational exercises with the 
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highest validity show a close resemblance with 
the criterion; that is, the best exercises are highly 
realistic (Asher & Sciarrino, 1974; Muchinsky, 
2003). 

Work samples and situational exercises are 
based on the conventional wisdom that the best pre- 
dictor of future performance in a specific domain is 
past performance in that same domain. Typically, a 
situational exercise requires the candidate to 
perform in a setting that is highly similar to the 
intended work environment. Thus, the resulting 
performance measures resemble those that make up 
the prospective job itself. 

Hundreds of work samples and situational ex- 
ercises have been proposed over the years. For 
example, in an earlier review, Asher and Sciarrino 
(1974) identified 60 procedures, including the 
following: 


e Typing test for office personnel 

Mechanical assembly test for loom fixers 

e Map-reading test for traffic control officers 

e Tool dexterity test for machinists and riveters 

Headline, layout, and story organization test for 

magazine editors 

e Oral fact-finding test for communication con- 
sultants 

e Role-playing test for telephone salespersons 

Business-letter-writing test for managers 


A very effective situational exercise that we will 
discuss here is the in-basket technique, a proce- 
dure that simulates the work environment of an 
administrator. 


The In-Basket Test 


The classic paper on the in-basket test is the mono- 
graph by Frederiksen (1962). For this comprehen- 
sive study Frederiksen devised the Bureau of 
Business In-Basket Test, which consists of the let- 
ters, memoranda, records of telephone calls, and 
other documents that have collected in the in-basket 
of a newly hired executive officer of a business bu- 
reau. In this test, the candidate is instructed not to 
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play a role, but to be himself.? The candidate is not 
to say what he would do, he is to do it. 

The letters, memoranda, phone calls, and inter- 
views completed by him in this simulated job envi- 
ronment constitute the record of behavior that is 
scored according to both content and style of the 
responses. Response style refers to how a task was 
completed—courteously, by telephone, by involv- 
ing a superior, through delegation to a subordinate, 
and so on. Content refers to what was done, in- 
cluding making plans, setting deadlines, seeking 
information; several quantitative indices were also 
computed, including number of items attempted 
and total words written. For some scoring criteria 
such as imaginativeness—the number of courses of 
action which seemed to be good ideas—expert 
judgment was required. 

Frederiksen (1962) administered his in-basket test 
to 335 subjects, including students, administrators, 
executives, and army officers. Scoring the test was a 
complex procedure that required the development of 
a 165-page manual. The odd-even reliability of the in- 
dividual items varied considerably, but enough mod- 
estly reliable items emerged (rs of .70 and above) that 
Frederiksen could conduct several factor analyses and 
also make meaningful group comparisons. 

When scores on the individual items were cor- 
related with each other and then factor analyzed, the 
behavior of potential administrators could be de- 
scribed in terms of eight primary factors. When 
scores on these primary factors were themselves fac- 
tor analyzed, three second-order factors emerged. 
These second-order factors describe administrative 
behavior in the most general terms possible. The 
first dimension is Preparing for Action, character- 
ized by deferring final decisions until information 
and advice is obtained. The second dimension is 
simply Amount of Work, depicting the large indi- 
vidual differences in the sheer work output. The 
third major dimension is called Seeking Guidance, 


2. We do not mean to promote a subtle sexism here, but in fact 
Frederiksen (1962) tested a predominantly (if not exclusively) 
male sample of students, administrators, executives, and army 
officers. i 
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with high scorers appearing to be anxious and inde- 
cisive. These dimensions fit well with existing the- 
ory about administrator performance and therefore 
support the validity of Frederiksen’s task. 

A number of salient attributes emerged when 
Frederiksen compared the subject groups on the 
scorable dimensions of the in-basket test. For 
example, the undergraduates stressed verbal pro- 
ductivity, the government administrators lacked 
concern with outsiders, the business executives 
were highly courteous, the army officers exhibited 
strong control over subordinates, and school prin- 
cipals lacked firm control. These group differences 
speak strongly to the construct validity of the in- 
basket test, since the findings are consistent with 
theoretical expectations about these subject groups. 

Early studies supported the predictive validity of 
in-basket tests. For example, Brass and Oldham 
(1976) demonstrated that performance on an in- 
basket test corresponded to on-the-job performance 
of supervisors if the appropriate in-basket scoring 
categories are used. Specifically, based on the in- 
basket test, supervisors who personally reward 
employees for good work, personally punish subor- 
dinates for poor work, set specific performance ob- 
jectives, and enrich their subordinates’ jobs are also 
rated by their superiors as being effective managers. 
The predictive power of these in-basket dimensions 
was significant, with a multiple correlation coeffi- 
cient of .54 between predictors and criterion. Stan- 
dardized in-basket tests can now be purchased for use 
by private organizations. Unfortunately, most of 
these tests are “in-house” instruments not available 
for general review. In spite of an occasional caution- 
ary review (e.g., Brannick et al., 1989), the in-basket 
technique is still highly regarded as a useful method 
of evaluating candidates for managerial positions. 


Assessment Centers 


An assessment center is not so much a place as a pro- 
cess. Many corporations and military branches—as 
well as a few progressive governments—have dedi- 
cated special sites to the application of in-basket and 
other simulation exercises in the training and selec- 


tion of managers. The purpose of an assessment cen- 
ter is to evaluate managerial potential by exposing 
candidates to multiple simulation techniques, includ- 
ing group presentations, problem-solving exercises, 
group discussion exercises, interviews, and in-basket 
techniques. Results from traditional aptitude and per- 
sonality tests also are considered in the overall evalu- 
ation. The various simulation exercises are observed 
and evaluated by successful senior managers who 
have been specially trained in techniques of observa- 
tion and evaluation. Assessment centers are used in a 
variety of settings, including business and industry, 
government, and the military. There is no doubt that a 
properly designed assessment center can provide a 
valid evaluation of managerial potential. Follow-up 
research has demonstrated that the performance of 
candidates at an assessment center is strongly corre- 
lated with supervisor ratings of job performance 
(Gifford, 1991). A more difficult question to answer 
is whether assessment centers are cost-effective in 
comparison to traditional selection procedures. After 
all, funding an assessment center is very expensive. 
The key question is whether the assessment center 
approach to selection boosts organizational produc- 
tivity sufficiently to offset the expense of the selec- 
tion process. Anecdotally, the answer would appear 
to be a resounding yes, since poor decisions from bad 
managers can be very expensive. However, there is 
little empirical data that addresses this issue. 

Goffin, Rothstein, and Johnston (1996) com- 
pared the validity of traditional personality testing 
(with the Personality Research Form; Jackson, 
1984b) and the assessment center approach in the 
prediction of the managerial performance of 68 man- 
agers in a forestry products company. Both methods 
were equivalent in predicting performance, which 
would suggest that the assessment center approach 
is not worth the (very substantial) additional cost. 
However, when both methods were used in combi- 
nation, personality testing provided significant in- 
cremental validity over that of the assessment center 
alone. Thus, personality testing and assessment cen- 
ter findings each contribute unique information help- 
ful in predicting performance. Case Exhibit 11.1 
illustrates an assessment center in action. 
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SUMMARY 


1. Industrial and organizational psychology 
(I/O psychology) deals with behavior in work situ- 
ations (business, advertising, and the military). VO 
psychologists use psychological testing and as- 
sessment for diverse purposes, including hiring, 
placement, promotion, and evaluation. 


2. Job analysis consists of defining a job in 
terms of the behaviors necessary to perform it. Job 
analysis includes two major components: job de- 
scription (physical and environmental characteris- 
tics of the work) and job specification (personal 
characteristics required). 


3. The Position Analysis Questionnaire 
(PAQ) and similar instruments provide quantifiable 
information pertinent to job analysis. For example, 
the PAQ assesses the following components of a 
job: information input, mental processes, work out- 
put, personal relationships, and job context. 


4. Psychological tests may play a major role 
in personnel selection, but they must be used with 
sensitivity to issues of predictive validity and legal 
concerns. I/O psychologists need to recognize that 
even an optimal selection approach may not be 
valid for all candidates. 


5. Autobiographical data, known as biodata, 
possess substantial predictive validity for many 
kinds of personnel selection. In many studies the 
predictive validity of biodata (with values in the 
.50s) rivals that of standardized tests. 


6. In the form in which it is typically used for 
personnel selection, the interview has low reliability 
and poor validity. Only when the interview is care- 
fully designed and highly structured can it provide a 
reliable and valid basis for personnel selection. 


7. Cognitive ability tests provide a good basis 
for personnel selection in most occupations. Ri- 
valled only by the work sample, ability tests have a 
validity coefficient of .54 averaged over many tests 
and many samples. 


8. Cognitive tests that measure general ability 
(g) often predict job performance better than mea- 


sures of specific abilities. The reason is that most 
jobs are factorially complex in their requirements, 
which ensures that measures of g will possess high 
predictive validity. 

9. When validated for the intended use, per- 
sonality and temperament tests may provide a use- 
ful basis for employee selection. For example, the 
Hogan Personality Inventory (HPI) is well vali- 
dated for prediction of job performance in military, 
hospital, and corporate settings. 


10. Paper-and-pencil integrity tests are de- 
signed to screen theft-prone individuals and other 
undesirable job candidates. Some of these instru- 
ments possess moderate predictive validity (e.g., 
personality-based measures), but their use raises 
many ethical concerns. 


11. The assessment of psychomotor and sen- 
sorimotor skills can be important for some occu- 
pations. For example, the Minnesota Rate of 
Manipulation Test provides a measure of finger- 
hand-arm dexterity useful for employee screening 
in a variety of industrial settings. 


12. A work sample is a miniature replica of the 
job for which examinees have applied. A properly 
designed work sample (e.g., prospective mechan- 
ics might be asked to install a pulley and repair a 
gearbox) can yield validity coefficients in the .40s, 
.50s, or .60s. 


13. Situational exercises such as the in-basket 
test are used mainly to select persons for manage- 
rial and professional positions. Although very time- 
consuming and expensive, situational exercises 
provide a valid basis for selection of managers. 


14. An assessment center is used to evaluate 
managerial potential by exposing candidates to mul- 
tiple simulation techniques, including group pre- 
sentations, problem-solving exercises, interviews, 
and in-basket techniques. Assessment center ratings 
help identify high-level managerial talent. 
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KEY TERMS AND CONCEPTS 
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ToPrıc1lB Appraisal of Work Performance 


Functions of Performance Appraisal 

Approaches to Performance Appraisal 

Sources of Error in Performance Appraisal 

Legal Issues in I/O Assessment 

Case Exhibit 11.2 Unwise Testing Practices in Employee Screening 


Summary 
Key Terms and Concepts 


R he appraisal of work performance is crucial 
to the successful operation of any business or 


organization. In the absence of meaningful feed- 
back, employees have no idea how to improve. In 
the absence of useful assessment, administrators 
have no idea how to manage personnel. It is diffi- 
cult to imagine how a corporation, business, or or- 
ganization could pursue an institutional mission 
without evaluating the performance of its employ- 
ees in one manner or another. 

Industrial and organizational psychologists fre- 
quently help devise rating scales and other instru- 
ments used for performance appraisal (Landy & 
Farr, 1983). When done properly, employee evalu- 
ation rests upon a solid foundation of applied psy- 
chological measurement—hence its inclusion as a 
major topic in this text. In addition to introducing 
essential issues in the measurement of work per- 
formance, we also touch briefly upon the many 
legal issues that surround the selection and ap- 
praisal of personnel. We begin by discussing the 
context of performance appraisal. 


FUNCTIONS OF 
PERFORMANCE APPRAISAL 


The evaluation of work performance serves many 
organizational purposes. The short list includes 
promotions, transfers, layoffs, and the setting of 


salaries—all of which may hang in the balance of 
performance appraisal. The long list includes the 
20 common uses identified by Cleveland, Murphy, 
and Williams (1989): 


Salary administration 

Promotion 

Retention or termination 

Recognition of individual performance 
Layoffs 

Identify poor performance 

Identify individual training needs 
Performance feedback 

Determine transfers and assignments 
Identify individual strengths and weaknesses 
Personnel planning 

Determine organizational training needs 
Evaluate goal achievement 

Assist in goal identification 

Evaluate personnel systems 

Reinforce authority structure 

Identify organizational development needs 
Criteria for validation research 

Document personnel decisions 

Meet legal requirements 


These applications of performance evaluation clus- 
ter around four major uses: comparing: individ- 
uals in terms of their overall performance levels; 
identifying and using information about individual 
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strengths and weaknesses; implementing and evalu- 
ating human resource systems in organizations; and 
documenting or justifying personnel decisions. Be- 
yond a doubt, performance evaluation is essential 
to the maintenance of organizational effectiveness. 

As the reader will soon discover, performance 
evaluation is a perplexing problem for which the 
simple and obvious solutions are usually incorrect. 
In part, the task is difficult because the criteria for 
effective performance are seldom so straightfor- 
ward as “dollar amount of widgets sold” (e.g., for 
a salesperson) or “percentage of students passing a 
national test” (e.g., for a teacher). As much as we 
might prefer objective methods for assessing the ef- 
fectiveness of employees, judgmental approaches 
are often the only practical choice for performance 
evaluation. 

The problems encountered in the implementa- 
tion of performance evaluation are usually referred 
to collectively as the criterion problem—a desig- 
nation that first appeared in the 1950s (e.g., Flana- 
gan, 1956; Landy & Farr, 1983). The phrase 
criterion problem is meant to convey the difficul- 
ties involved in conceptualizing and measuring per- 
formance constructs, which are often complex, 
fuzzy, and multidimensional. For a thorough dis- 
cussion of the criterion problem, the reader should 
consult comprehensive reviews by Austin and Vil- 
lanova (1992) and Campbell, Gasser, and Oswald 
(1996). We touch upon some aspects of the crite- 
rion problem in the following review. 


PERFORMANCE APPRAISAL 


There are literally dozens of conceptually distinct 
approaches to the evaluation of work performance. 
In practice, these numerous approaches break down 
into four classes of information: performance mea- 
sures such as productivity counts; personnel data 
such as rate of absenteeism; peer ratings and self- 
assessments; and supervisor evaluations such as 
rating scales. Rating scales completed by supervi- 
sors are by far the preferred method of performance 
appraisal, as discussed later. First, we mention the 
other approaches briefly. 
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Performance Measures 


Performance measures include seemingly objective 
indices such as number of bricks laid for a mason, 
total profit for a salesperson, or percentage of stu- 
dents graduated for a teacher. Although production 
counts would seem to be the most objective and 
valid methods for criterion measurement, there are 
serious problems with this approach (Guion, 1965). 
The problems include the following: 


The rate of productivity may not be under the 
control of the worker. For example, the fast-food 
worker can only sell what people order, and the 
assembly line worker can only proceed at the 
same pace as coworkers. 

Production counts are not applicable to most 
jobs. For example, relevant production units do 
not exist for a college professor, a judge, or a 
hotel clerk. 

An emphasis upon production counts may distort 
the quality of the output. For example, pharma- 
cists in a mail-order drug emporium may fill pre- 
scriptions with the wrong medicine if their work 
is evaluated solely upon productivity. 


Another problem is that production counts may be 
unreliable, especially over short periods of time. 
Finally, production counts may tap only a small 
proportion of job requirements, even when they 
appear to be the definitive criterion. For example, 
sales volume would appear to be the ideal criterion 
for most sales positions. Yet, a salesperson can 
boost sales by misrepresenting company products. 
Sales may be quite high for several years—until 
the company is sued by unhappy customers. Pro- 
ductivity is certainly important in this example, 
but the corporation should also desire to assess 
interpersonal factors such as honesty in customer 
relations. 


Personnel Data: Absenteeism 


Personnel data such as rate of absenteeism pro- 
vide another possible basis for performance 
evaluation. Certainly employers have good reason 
to keep tabs on absenteeism and to reduce it 
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through appropriate incentives. Steers and Rhodes 
(1978) calculated that absenteeism costs about $25 
billion a year in lost productivity! Little wonder 
that absenteeism is a seductive criterion measure 
that has been researched extensively (Harrison & 
Hulin, 1989). 

Unfortunately, absenteeism turns out to be a 
largely useless measure of work performance, ex- 
cept for the extreme cases of flagrant work truancy. 
A major problem is defining absenteeism. Landy 
and Farr (1983) list 28 categories of absenteeism, 
many of which are uncorrelated with the others. 
Different kinds of absenteeism include scheduled 
versus unscheduled, authorized versus unautho- 
rized, justified versus unjustified, contractual ver- 
sus noncontractual, sickness versus nonsickness, 
medical versus personal, voluntary versus involun- 
tary, explained versus unexplained, compensable 
versus noncompensable, certified illness versus 
casual illness, Monday/Friday absence versus mid- 
week, and reported versus unreported. When is a 
worker truly absent from work? The criteria are 
very slippery. 

In addition, absenteeism turns out to be an 
atrociously unreliable variable. The test-retest 
correlations (absentee rates from two periods of 
identical length) are as low as .20, meaning that 
employees display highly variable rates of absen- 
teeism from one time period to the next. A related 
problem with absenteeism is that workers tend to 
underreport it for themselves and overreport it 
for others (Harrison & Shaffer, 1994). Finally, for 
the vast majority of workers, absenteeism rates 
are quite low. In short, absenteeism is a poor 
method for assessing worker performance, except 
for the small percentage of workers who are chron- 
ically truant. 


Peer Ratings and Self-Assessments 


Some researchers have proposed that peer ratings 
and self-assessments are highly valid and constitute 
an important complement to supervisor ratings. A 
substantial body of research pertains to this ques- 
tion, but the results are often confusing and contra- 
dictory. Nonetheless, it is possible to list several 


generalizations (Harris & Schaubroeck, 1988; 
Smither, 1994): 


e Peers give more lenient ratings than supervisors. 
The correlation between self-ratings and super- 
visor ratings is minimal. 

The correlation between peer ratings and super- 
visor ratings is moderate. 

Supervisors and subordinates have different 
ideas about what is important in jobs. 


Overall, reviewers conclude that peer ratings and 
self-assessments may have limited application for 
purposes such as personal development, but their 
validity is not yet sufficiently established to justify : 
widespread use (Smither, 1994). 


Supervisor Rating Scales 


Rating scales are the most common measure of job 
performance (Landy & Farr, 1983; Muchinsky, 
2003). These instruments vary from simple graphic 
forms to complex scales anchored to concrete be- 
haviors..In general, supervisor rating scales reveal 
only fair reliability, with a mean interrater relia- 
bility coefficient of .52 across many different 
approaches and studies (Viswesvaran, Ones, & 
Schmidt, 1996). In spite of their weak reliability, 
supervisor ratings still rank as the most widely used 
approach. About three-quarters of all performance 
evaluations are based upon judgmental methods 
such as supervisor rating scales (Landy, 1985). 

The simplest rating scale is the graphic rating 
scale, introduced by Donald Paterson in 1922 
(Landy & Farr, 1983). A graphic rating scale con- 
sists of trait labels, brief definitions of those labels, 
and a continuum for the rating. As the reader will 
notice in Figure 11.1, several types of graphic 
rating scales have been used. 

The popularity of graphic rating scales is 
due, in part, to their simplicity. But this is also a 
central weakness because the dimension of work 
performance being evaluated may be vaguely de- 
fined. Dissatisfaction with graphic rating scales 
led to the development of many alternative ap- 
proaches to performance appraisal, as discussed in 
this section. 
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FIGURE 11.1 Examples of Graphic Rating Scales 


A critical incidents checklist is based upon struct the instrument by submitting specific exam- 
actual episodes of desirable and undesirable on- ples of desirable and undesirable job behavior. For 
the-job behavior (Flanagan, 1954). Typically, a example, suppose that we intended to develop a 
checklist developer will ask employees to help con- checklist to appraise the performance of resident 
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advisers (RAs) in a dormitory. Modeling a study by 
Aamodt, Keller, Crawford, and Kimbrough (1981), 
we might ask current dormitory RAs the following 
question: 


Think of the best RA that you have ever known. 
Please describe in detail several incidents 
that reflect why this person was the best ad- 
viser. Please do the same for the worst RA 
you have ever known. 


Based upon hundreds of nominated behaviors, 
checklist developers would then proceed to distill 
and codify these incidents into a smaller number of 
relevant behaviors, both desirable and undesirable. 
For example, the following items might qualify for 
the RA checklist: 


—— stays in dorm more than required 
—— breaks dormitory rules 

—— iş fair about discipline 

—— plans special programs 

—— fails to discipline friends 

—— is often unfriendly 

____ shows concern about residents 
_—__—_ comes across as authoritarian 


Of course, the full checklist would be much longer 
than the preceding. The RA supervisor would com- 
plete this instrument as a basis for performance ap- 
praisal. If needed, an overall summary score can be 
derived from an appropriate weighting of individ- 
ual items. Harvey (1991) discusses the advantages 
and disadvantages of this approach. 

Another form of criterion-referenced judgmen- 
tal measure is the behaviorally anchored rating 
scale (BARS). The classic work on BARS dates 
back to Smith and Kendall (1963). These authors 
proposed a complex developmental procedure for 
producing criterion-referenced judgments. The 
procedure uses a number of experts to identify and 
define performance dimensions, generate behavior 
examples, and scale the behaviors meaningfully. 
Overall, the procedure is quite complex, time- 
consuming, and expensive. A number of variations 
and improvements have been suggested (Harvey, 


1991). An advantage to BARS and other behavior- 
based scales is their strict adherence to EEOC 
(Equal Employment Opportunity Commission) 
guidelines discussed later in this chapter. BARS 
and related approaches focus upon behaviors as op- 
posed to personality or attitudinal characteristics. 
A behaviorally anchored scale for job performance 
of a sales supervisor is depicted in Figure 11.2. Of 
course, the comprehensive evaluation of a sales 
manager would include additional scales for other 
aspects of work. 

Research on improving the accuracy of ratings 
with BARS is mixed. Some studies find fewer rat- 
ing errors—especially a reduction in unwarranted 
leniency of evaluations—whereas other studies re- 
port no improvement with BARS compared to 
other evaluation methods (Murphy & Pardaffy, 
1989). Overall, Muchinsky (2003) concludes that 
the BARS approach is not much better than graphic 
rating scales in reducing rating errors. Nonetheless, 
the scale development process of BARS may have 
indirect benefits in that supervisors are compelled 
to pay close attention to the behavioral components 
of effective performance. 

A behavior observation scale (BOS) is a vari- 
ation upon the BARS technique. The difference be- 
tween the two is that the BOS approach uses a 
continuum from “almost never” to “almost always” 
to measure how often an employee performs the 
specific tasks on each behavioral dimension. As 
with the BARS technique, researchers question 
whether behavior observation scales are worth the 
extra effort (Guion, 1998). 

A forced-choice scale is designed to eliminate 
bias and subjectivity in supervisor ratings by forc- 
ing a choice between options that are equal in so- 
cial desirability. In theory, this approach makes it 
impossible for the supervisor to slant ratings in a 
biased or subjective manner. We will use the path- 
breaking research by Sisson (1948) to illustrate the 
features of this approach. He developed a scale to 
evaluate Army officers that consisted of tetrads of 
behavioral descriptors. Each tetrad contained two 
positive items matched for social desirability and 
two negative items also matched for social desir- 
ability. The four items in each tetrad were topically 
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Could be expected to give his sales personnel 
confidence and a strong sense of responsibility 
by delegating many important jobs to them. 


Could be expected to exhibit courtesy and respect 
toward his sales personnel. 


5 
Could be expected to be rather critical of store 
standards in front of his own people, thereby 
risking their developing poor attitudes. 

3 


Could be expected to go back on a promise to 
an individual whom he had told could transfer 
back into previous department if she/he didn’t 
like the new one. 


Could be expected to conduct a full day’s sales 
clinic with two new sales personnel and thereby 
develop them into top sales people in the 
department. 


Could be expected never to fail to conduct 
training meetings with his people weekly at a 
scheduled hour and to convey to them exactly 
what he expects. 


Could be expected to remind sales personnel to 
wait on customers instead of conversing with 
each other. 


Could be expected to tell an individual to come 
in anyway even though she/he called in to say 
she/he was ill. 


2 ë 


Could be expected to make promises to an 
individual about her/his salary being based 
on department sales even when he knew such 
a practice was against company policy. 





FIGURE 11.2 Behaviorally Anchored Rating Scale for Sales Supervisor 
Source: Reprinted with permission from Campbell, J. P., Dunnette, M. D., Arvey, R. D., & Hellervik, L. V. (1973). The development and 
evaluation of behaviorally based rating scales. Journal of Applied Psychology, 57, 15-22. 


related to a single performance dimension. Un- 
known to the supervisors who completed the rating 
scale, one of the two positive items was judged very 
descriptive of effective Army officers and the other 


judged less so. Likewise, one of the two negative 
items was judged more descriptive of ineffective 
Army officers and the other judged less so. Here is 
a sample tetrad (Borman, 1991): 
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Most Least 
Descriptive Descriptive 
A. Cannot assume 
responsibility 
B. Knows how and when 
to delegate authority 
C. Offers suggestions 
D. Changes ideas too 
easily 











Supervisors were asked to review the items in each 
tetrad and to check one item as most descriptive 
and one item as least descriptive of the officer 
being evaluated. A score of +1 was awarded for 
responding “most descriptive” to the positively 
keyed item (in this case, alternative B) or “least de- 
Scriptive” to the negatively keyed item (in this case 
alternative A), whereas a score of —1 was awarded 
for responding “least descriptive” to the positively 
keyed item or “most descriptive” to the negatively 
keyed item. Responding to the nonkeyed items (al- 
ternatives C and D) as most or least descriptive 
earned a score of 0. Thus, each tetrad yielded a 
five-point continuum of scores: +2, +1, 0, -1, —2. 
The summary score used for performance ap- 
praisal consisted of the algebraic sum of the indi- 
vidual items. 

The forced-choice approach has never really 
caught on, due largely to the effort required in scale 
construction. This is unfortunate, because the 
method does effectively reduce unwanted bias. 
Borman (1991) refers to this approach as a “bold 
initiative” that produces a relatively objective rating 
scale. 


SOURCES OF ERROR IN 
PERFORMANCE APPRAISAL 


The most difficult problem in the assessment of job 
performance is the proper definition of appraisal 
criteria. If the supervisor is using a poorly designed 
instrument that does not tap the appropriate di- 
mensions of job behavior, then almost by definition 
the performance appraisal will be inaccurate, in- 
complete, and erroneous. Undoubtedly, the failure 
to identify appropriate criteria for acceptable and 
unacceptable performance is a major source of 


error in performance appraisal. But it is not the only 
source. Even when supervisors have access to ex- 
cellent, well-designed measures of performance 
appraisal, various sorts of subtle errors can creep 
in. We discuss three such additional sources of rat- 
ing error: halo effect, rater bias, and criterion con- 
tamination. 


Halo Effect 


The tendency to rate an employee high or low on 
all dimensions because of a global impression is 
called halo effect. Research on the halo effect 
can be traced back to the early part of this century 
(Thorndike, 1920). The most common halo effect 
is a positive halo effect. In this case, an employee 
receives a higher rating than deserved because the 
supervisor fails to be objective when rating specific 
aspects of the employee’s behavior. A positive 
halo effect is usually based upon overgenerali- 
zation from one element of a worker’s behavior. 
For example, an employee with perfect attendance 
may receive higher-than-deserved evaluations 
on productivity and work quality—even though 
attendance is not directly related to these job 
dimensions. 

Smither (1998) lists the following approaches 
to control for halo effects: 


e Provide special training for raters 

e Supervise the supervisors during the rating 

e Practice simulations before doing the ratings 

« Keep a diary of information relevant to appraisal 

e Provide supervisors with a short lecture on halo 
effects 


Additional approaches to rater training are dis- 
cussed by Goldstein (1991). An intriguing analysis 
of the nature and consequences of halo error can be 
found in Murphy, Jako, and Anhalt (1993). Con- 
trary to the reigning prejudice against halo errors, 
these researchers conclude that the halo effect does 
not necessarily detract from the accuracy of ratings. 
They point out that a presumed halo effect is often 
the by-product of true overlap on the dimensions 
being rated. The debate over halo effect is not likely 
to be resolved any time soon (Arvey & Murphy, 
1998). 
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Rater Bias 


The potential sources of rater bias are so numer- 
ous that we can only mention a few prominent ex- 
amples here. Leniency or severity errors occur 
when a supervisor tends to rate workers at the ex- 
tremes of the scale. Leniency may reflect social dy- 
namics, as when the supervisor wants to be liked by 
employees. Leniency is also caused by extraneous 
factors such as the attractiveness of the employee. 
Severity errors refer to the practice of rating all as- 
pects of performance as deficient. In contrast, cen- 
tral tendency errors occur when the supervisor rates 
everyone as nearly average on all performance 
dimensions. Context errors occur when the rater 
evaluates an employee in the context of other em- 
ployees rather than based upon objective perfor- 
mance. For example, the presence of a workaholic 
salesperson with extremely high sales volume 
might cause the sales supervisor to rate other sales 
personnel lower than deserved. 

Recently, researchers have paid considerable 
attention to the possible biasing effects of whether 
a supervisor likes or dislikes a subordinate. Sur- 
prisingly, the trend of the findings is that supervi- 
sor affect (liking or disliking) toward specific 
employees does not introduce rating bias. In gen- 
eral, strong affect in either direction represents 
valid information about an employee. Thus, ratings 
of affect often correlate strongly with performance 
ratings, but this is because both are a consequence 
of how well or poorly the employee does the job 
(Ferris, Judge, Rowland, & Fitzgibbons, 1994; 
Varma, DeNisi, & Peters, 1996). Other forms of 
rater bias are discussed by Goldstein (1991) and 
Smither (1994). 


Criterion Contamination 


Criterion contamination is said to exist when a 
criterion measure includes factors that are not 
demonstrably part of the job (Borman, 1991; Har- 
vey, 1991). For example, if a performance measure 
includes appearance, this would most likely be a 
case of criterion contamination—unless appear- 
ance is relevant to job success. Likewise, evaluat- 
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ing an employee on “dealing with the public” is 
only appropriate if the job actually requires the em- 
ployee to meet the public. Goldstein (1992) out- 
lines three kinds of criterion contamination: 


1. Opportunity bias occurs when workers have dif- 
ferent opportunities for success, as when one 
salesperson is assigned to a wealthy neighbor- 
hood and others must seek sales in isolated, rural 
areas. 

2. Group characteristic bias is present when the 
characteristics of the group affect individual per- 
formance, as when workers in the same unit 
agree to limit their productivity to maintain pos- 
itive social relations. 

3. Knowledge of predictor bias occurs when a su- 
pervisor permits personal knowledge about an 
employee to bias the appraisal, as when quality 
of the college attended by a new worker affects 
her evaluation. 


Careful attention to job analysis as a basis for 
selection of appraisal criteria is the best way to re- 
duce errors in performance appraisal. In addition, 
employers should follow certain guidelines in per- 
formance appraisal, as discussed in the following 
section. 


Guidelines for Performance Appraisal 


Performance appraisal is a formidable task. Not 
only must employers pay attention to the psycho- 
metric soundness of their approach, they must 
also design a practical system that meets organi- 
zational goals. For example, appraisal standards 
must be sufficiently difficult and detailed to ensure 
that organizational goals are accomplished. An- 
other concern is that performance appraisal falls 
under the purview of Title VII of the Civil Rights 
Act of 1964. Hence, employers must develop fair 
systems that do not discriminate on the basis of 
race, sex, and other protected categories. To com- 
plicate matters, these standards—soundness, prac- 
ticality, legality—-may conflict with one another. 
The practical approach may be neither psycho- 
metrically sound nor legal. Often, appraisal meth- 
ods that show the best measurement characteristics 


TOPIC 11B APPRAISAL OF WORK PERFORMANCE 433 


(e.g., strong interrater reliability) will fail to assess 


the most important aspects of performance; that ~ 


is, they are not practical. This is a familiar refrain 
within the measurement field. Too often, psychol- 
ogists must choose between rigor and relevance, 
rarely achieving both at the same time. Finally, 
legal considerations must be considered when ex- 
ploring the limits of performance appraisal. 

Smither (1998) has published guidelines for 
developing performance appraisal systems that we 
paraphrase here: 


e Base the performance appraisal upon a careful 
job analysis 

Develop specific, contamination-free criteria for 
appraisal from the job analysis 

Determine that the instrument used to rate per- 
formance is appropriate for the appraisal situation 
Train raters to be accurate, fair, and legal in their 
use of the appraisal instrument 

Use performance evaluations at regular intervals 
of six months to a year 

Evaluate the performance appraisal system peri- 
odically to determine whether it is actually im- 
proving performance 


The training of raters is an especially important 
guideline. An appraisal system that seems perfectly 
straightforward to the employer could easily be 
misunderstood by an untrained rater, resulting in 
biased evaluations. Borman (1991) notes that two 
kinds of rater training are effective: rater error train- 


ing, in which the trainer seeks simply to alert raters ~ 


to specific kinds of errors (e.g., halo effect); and 
frame-of-reference training, in which the trainer fa- 
miliarizes the raters with the specific content of 
each performance dimension. Research indicates 
that these kinds of training improve the accuracy of 
ratings. 


| LEGAL ISSUES IN 1/0 ASSESSMENT 


Nearly every aspect of the employment relation- 
ship is subject to the law: recruitment, screening, 
selection, placement, compensation, promotion, 
and performance appraisal all fall within the 
domain of legal interpretations (Cascio, 1987). 


However, courts and legislative bodies have re- 
served special scrutiny for employment-related 
testing. The practitioner who refuses to learn rele- 
vant legal guidelines in personnel testing does 
so at great peril, because unwise practices can 
lead to costly and time-consuming litigation (Case 
Exhibit 11.2). 

Personnel testing is particularly sensitive be- 
cause the consequences of an adverse decision are 
often grave: The applicant does not get the job, or 
an employee does not get the desired promotion or 
placement. Recognizing that employment testing 
performs a sensitive function as gatekeeper to eco- 
nomic advantage, Congress has passed laws 
sharply regulating the use of testing. The courts 
have also rendered decisions that help define unfair 
test discrimination. In addition, regulatory bodies 
have published guidelines that substantially impact 
testing practices. We will provide a current per- 
spective on the regulation of personnel testing by 
tracing the development of laws, regulations, and 
major court cases. 

It may surprise the reader to learn that em- 
ployment testing has raised legal controversy only 
in the last 35 years (Arvey & Faley, 1988). During 
this period, several definitive court decisions and 
path-breaking governmental directives have helped 
define current legal trends. These landmarks are 
depicted in Table 11.11, beginning with the Civil 
Rights Act of 1964, proceeding through the federal 
regulations of the Equal Employment Opportunity 
Commission (EEOC), and concluding with very 
recent court cases and legislative developments. 
We will review these landmarks in chronological 
order. 


Early Court Cases and Legislation 


During the presidency of Lyndon Johnson, Con- 
gress passed the Civil Rights Act of 1964. This early 
civil rights legislation had a profound effect on 
employee-testing procedures. In addition to broad 
provisions designed to prevent discrimination in 
many social contexts, Title VII of this act prohibits 
employment practices that discriminate on the basis 
of race, color, religion, sex, or national origin. The 


TABLE 11.11 


1964 


1964 


1966 


1971 


1973 


1975 
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Myart v. Motorola. This case set the 
precedent for courts to hear employment 
testing cases. 

Civil Rights Act. This act prohibits job 
discrimination based on sex, race, color, 
religion, or national origin. 


‘EEOC Guidelines. The first published 


guidelines on employment testing practices. 
Griggs v. Duke Power Company. The Supreme 
Court rules that employment test results 
must have a demonstrable link to job 
performance. 

United States v. Georgia Power Company. 
Ruling strengthens the authority of EEOC 
guidelines for studies of employment testing 
validity. 

Albemarle v. Moody. EEOC guidelines 
strengthened; subjective supervisory ratings 
tuled a poor basis for validating tests. 


1976 


1978 


1988 


1990 


1991 


Major Legal Landmarks in Employment Testing 


Washington v. Davis. Court ruled that perfor- 
mance in a training program was a sufficient 
basis against which to validate a test. 

Uniform Guidelines on Employee Selection. 
These guidelines defined adverse impact by the 
four-fifths rule and incorporated criteria for 
validity in employee selection studies. 

Watson v. Fort Worth Bank and Trust. The 
court ruled that subjective employment devices 
such as the interview can be validated; em- 
ployees can claim disparate impact based on 
interview-based promotion policies. 
Americans with Disabilities Act. This act 
sharply limits the reasons for not hiring a dis- 
abled person. One provision is that medical 
tests may not be administered prior to an offer 
of employment. 

Civil Rights Act. This act outlaws subgroup 
norming of employee selection tests. 
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act established several important general principles 
relevant to employment testing (Cascio, 1987): 


e Discriminatory preference for any group, minor- 
ity or majority, is barred by the act. 

« The employer bears the burden of proof that all 

requirements for employment, including test 

scores, are related to job performance. 

Professionally developed tests used in personnel 

testing must be job related. . 

In addition to open and deliberate discrimination, 

the law forbids practices that are fair in form but 

discriminatory in operation. 

Intent is irrelevant: the plaintiff need not show 

that discrimination was intentional. 

In spite of these proscriptions, job-related tests 

and other measuring devices are deemed both 

legal and useful. 


The 1964 legislation also created the Equal Em- 
ployment Opportunity Commission (EEOC) to de- 
velop guidelines defining fair employee-selection 
procedures. The initial guidelines, published in 
1966, were vague. Later revisions of these guide- 
lines, including the Uniform Guidelines on Em- 
ployee Selection (1978), were quite specific and 
have been used by the courts to help resolve legal 
disputes regarding employment-testing practices 
(see the following section). 

The 1964 Myart v. Motorola case marked the 
first involvement of the courts in employment 
testing. The issues raised by this landmark case are 
still reverberating today. Leon Myart was an African 
American applicant for a job at one of Motorola’s 
television assembly plants. Even though he had 
highly relevant job experience, Mr. Myart was re- 
fused a position because his score on a brief screen- 
ing test of intelligence fell below the company 
cutoff. Claiming racial discrimination, he filed an 
appeal with the Illinois Fair Employment Practices 
Commission. The state examiner found in favor of 
the complainant and directed that the Motorola 
company should offer Mr. Myart a job. In addition, 
the examiner ruled that the particular test should not 
be used in the future and that any new test should 
“take into account the environmental factors which 
contribute to cultural deprivation.” In essence, the 


examiner concluded that Motorola’s employment- 
testing practices were unfair because they acted as 
a barrier to the employment of culturally deprived 
and disadvantaged applicants. Even though the case 
was later overturned for lack of evidence, Myart v. 
Motorola did set the precedent to hear such com- 
plaints in the court system (Arvey & Faley, 1988). 


Advent of EEOC Employment 
Testing Standards 


During the 1970s, several court cases helped shape 
current standards and practices in employment test- 
ing. The focus of Griggs v. Duke Power Company 
(1971) was the use of tests—in this case the Won- 
derlic Personnel Test and the Bennett Mechanical 
Comprehension Test—as eligibility criteria for em- 
ployees who wanted to transfer to other depart- 
ments. In particular, employees at Duke Power 
Company who lacked a high school education 
could qualify for transfer if they scored above the 
national median on both tests. This policy appeared 
to discriminate against African American employ- 
ees since it was disproportionately difficult for 
them to gain eligibility for transfer. However, lower 
courts found no discriminatory intent and therefore 
found in favor of the power company. 

In 1971, the Supreme Court reversed the lower 
court findings, ruling against the use of tests without 
their validation. The decision emphasized several 
points of current relevance (Arvey & Faley, 1988): 


e Fairness in employment testing is determined by 
consequences, not motivations. 

Testing practices must have a demonstrable link 
to job performance. 

The employer has the burden of showing that 
an employment practice such as testing is job 
related. 

Diplomas, degrees, or broad testing devices are 
not adequate as measures of job-related capability. 
The EEOC testing standards deserve consider- 
able deference from employment testers. 


These employment testing guidelines were 
further refined in a 1973 court decision, United 
States v. Georgia Power Company. In this case, the 
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Georgia Power Company presented a validation 
study to support its employment-testing practices 
when its policies were shown to have an adverse 
impact upon the hiring and transferring of African 
Americans. However, the validation study was 
weak, in part because it was based upon multiple 
discriminant analysis, a complex statistical tech- 
nique rarely used for this purpose. The courts ruled 
that the validation study was inadequate since it did 
not adhere to EEOC guidelines for evaluating 
validity studies. This finding ensconced the EEOC 
guidelines as virtually the law of the land in 
employment-testing practices. 

Several other court cases in the 1970s and 
1980s also served to strengthen the authority of 
EEOC testing guidelines. These cases were quite 
complex and involved multiple issues in addition to 
those cited here. In Albemarle v. Moody (1975), the 
Supreme Court deferred to EEOC guidelines in 
finding that subjective supervisory ratings are am- 
biguous and therefore constitute a poor basis for 
evaluating the validity of an employment selection 
test. The central issue in Washington v. Davis 
(1976) was whether performance in a training pro- 
gram (as opposed to actual on-the-job perfor- 
mance) was a sufficient basis for determining the 
job-relatedness of the employment selection pro- 
cedures. In this case, the Supreme Court ruled that 
performance in a police officer training program 
was a sufficient criterion against which to validate 
a selection test. 

In State of Connecticut v. Teal, the U.S. 
Supreme Court sided with four African American 
state employees who had failed a written test that 
was used to screen applicants for the position of 
welfare eligibility supervisor. The workers claimed 
unfair discrimination, noting that only 54 percent of 
minority applicants passed, compared to 80 percent 
for whites. In its defense, the state of Connecticut 
argued that discrimination did not exist, since 23 
percent of the successful African American appli- 
cants were ultimately promoted, compared to 14 
percent for whites. The Court was not impressed 
with this argument, noting that Title VII of the 1964 
Civil Rights Act was specifically designed to pro- 
tect individuals, not groups. Thus, any unfairness 
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to an individual is unacceptable. Further analysis of 
fair employment court cases can be found in Arvey 
and Faley (1988), Cascio (1987), Kleiman and 
Faley (1985), and Russell (1984). 


Uniform Guidelines on Employee Selection 


During the 1970s, several federal agencies and pro- 
fessional groups proposed revisions and extensions 
of the existing EEOC employment testing guide- 
lines. The revisions were developed in response to 
court decisions which had interpreted EEOC guide- 
lines in a narrow, inflexible, legalistic manner. 
However, the existence of several sets of compet- 
ing guidelines was confusing, and strong pressures 
were exerted upon the involved parties to forge a 
compromise. These efforts culminated in a consen- 
sus document known as the 1978 Uniform Guide- 
lines on Employee Selection. 

The Uniform Guidelines quickly earned respect 
in court cases and were frequently cited in the res- 
olution of legal disputes. The new guidelines con- 
tain interpretation and guidance not found in earlier 
versions, particularly regarding adverse impact, 
fairness, and the validation of selection procedures, 
as discussed later. 

The Uniform Guidelines provide a very specific 
definition of adverse impact. In general, when se- 
lection procedures favor applicants from one group 
(usually males or whites), the basis for selection is 
said to have an adverse impact on other groups 
(usually females or nonwhites) with a lower selec- 
tion proportion. The Uniform Guidelines define ad- 
verse impact with a four-fifths rule. Specifically, 
adverse impact exists if one group has a selec- 
tion rate less than four-fifths of the rate of the 
group with the highest selection rate. For example, 
consider an employer who has 200 applicants in a 
year, 100 African American and 100 white. If 120 
persons were hired, including 80 whites and 40 
African Americans, then the percentage of whites 
hired is 80 percent (80/100), whereas the percent- 
age of African Americans hired is 40 percent 
(40/100). Since the selection rate for African Amer- 
icans is only half that of whites (40 percent/80 per- 
cent), the employer might be vulnerable to charges 
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of adverse impact. We should note that the Uniform 
Guidelines suggest caution about this rule when 
sample sizes are small. 

The Uniform Guidelines also pay more atten- 
tion to fairness than previous documents. Fairness 
is treated in the following manner: 


When members of one racial, ethnic, or sex group 
characteristically obtain lower scores on a selection 
procedure than members of another group, and the 
differences are not reflected in differences in a mea- 
sure of job performance, use of the selection proce- 
dure may unfairly deny opportunities to members of 
the group that obtain the lower scores. Furthermore, 
in cases where two or more selection procedures are 
equally valid, the employer is obliged to use the 
method that produces the least adverse impact. 


The Uniform Guidelines also establish a strong 
affirmative action responsibility on the part of em- 
ployers. If an employer finds a substantial dispar- 
ity in persons hired from a subgroup compared to 
their availability in the job market, several correc- 
tive steps are recommended. These corrective mea- 
sures include specialized recruitment programs 
designed to attract qualified members of the group 
in question, on-the-job training programs so that af- 
fected minorities do not get locked into dead-end 
jobs, and a revamping of selection procedures to re- 
duce or eliminate exclusionary effects. 

Finally, the guidelines provide specific techni- 
cal standards for evaluating validity studies of em- 
ployee selection procedures. The courts will almost 

. certainly consult these Uniform Guidelines if em- 
ployees bring suit against the company for alleged 
unfairness in employee selection practices. Thus, it 
is a foolish employer who does not pay special at- 
tention to these technical criteria. For example, one 
criterion concerns the use of performance scores 
obtained during training programs: 


Where performance in training is used as a crite- 
rion, success in training should be properly mea- 
sured and the relevance of the training should be 
shown either through a comparison of the content 
of the training program with the critical or impor- 
tant work behavior(s) of the job(s), or through a 
demonstration of the relationship between mea- 
sures of performance in training and measures of 
job performance. 


Thus, preemployment evaluation of job candidates 
in a training program may constitute a valid method 
of employee selection, but only if a strong link ex- 
ists between the task demands of training and the 
requirements of the actual job. 

The Uniform Guidelines contain many other 
criteria that we cannot review here. We urge the 
reader to read this fascinating and influential doc- 
ument which is often cited in court cases on em- 
ployment discrimination. 


Legal Implications of Subjective 
Employment Devices 


In many corporations, promotions are based upon 
the subjective judgment of senior managers. A 
common practice is for one or more managers to 
interview several qualified employees and offer a 
promotion to the one candidate who appears most 
promising. The selection of this candidate is typi- 
cally based upon subjective appraisal of such fac- 
tors as judgment, originality, ambition, loyalty, and 
tact. Until recently, these subjective employment 
devices appeared to be outside the scope of fair em- 
ployment practices codified in the Uniform Guide- 
lines and other sources. 

However, in a recent civil rights case, Watson v. 
Fort Worth Bank and Trust (1988), the Supreme 
Court made it easier for employees to prove 
charges of race or sex discrimination against em- 
ployers who use interview and other subjective 
assessment devices for employee selection or pro- 
motion. We outline the factual background of this 
important case before discussing the legal implica- 
tions (Bersoff, 1988). 

Clara Watson, an African American employee 
at Fort Worth Bank and Trust, was rejected for pro- 
motion to supervisory positions four times in a row. 
Each time, a white applicant received the promo- 
tion. Watson obtained evidence showing that the 
bank had never had an African American officer or 
director, had only one African American super- 
visor, and paid African American employees lower 
salaries than equivalent white employees. Fur- 
thermore, all supervisors had to receive approval 
from a white male senior vice president for their 
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promotion decisions. The bank did not dispute that 
it made hiring and promotion decisions solely on 
the basis of subjective judgment. When an analysis 
of promotion patterns confirmed statistically sig- 
nificant racial disparities, Watson brought suit 
against the bank. 

Two legal theories were available for Watson to 
litigate her claim under Title VII of the 1964 Civil 
Rights Act. The two theories are called “disparate 
treatment” and “disparate impact.” A disparate 
treatment case is more difficult to litigate, since 
the plaintiff must prove that the employer engaged 
in intentional discrimination. In a disparate impact 
case, intention is irrelevant. Instead, the plaintiff 
need merely show that a particular employment 
practice—such as using a standardized test—results 
in an unnecessary and disproportionately adverse 
impact upon a protected minority. 

The lower courts ruled that Watson was re- 
stricted to the more limited disparate treatment ap- 
proach since the employer had used subjective 
evaluation procedures. Furthermore, the lower 
courts ruled that the bank had not engaged in in- 
tentional discrimination and did have legitimate 
reasons for not promoting Watson. Nonetheless, the 
Supreme Court agreed to hear the case in order to 
determine whether a disparate impact analysis 
could be applied to subjective employment devices 
such as interview. Relying heavily upon a brief 
from the American Psychological Association 
(APA, 1988), the Supreme Court ruled unani- 
mously that the disparate impact analysis is applic- 
able to subjective or discretionary promotion 
practices based on interview. In effect, the court 
ruled that subjective employment devices such as 
interview can be validated. Thus, employers do not 
have unmonitored discretion to evaluate applica- 
tions for promotion based on subjective interview. 
As a consequence of Watson v. Fort Worth Bank 
and Trust, employers must be ready to defend all 
their promotion practices—including subjective 
interview—against claims of adverse impact. 


Recent Developments in Employee Selection 


In 1990, Congress passed the Americans with Dis- 
abilities Act (ADA), which forbids discrimination 


INDUSTRIAL AND ORGANIZATIONAL ASSESSMENT 


against qualified individuals with disabilities. The 
ADA was discussed briefly in Topic 7A, Testing 
Special Populations. This act protects job appli- 
cants with disabilities by sharply limiting permis- 
sible reasons for refusing to hire them. Specifically, 
employers can decline to hire a disabled worker for 
only the following reasons (1) if hiring the appli- 
cant would cause the company undue hardship in 
terms of making accommodations for the disabil- 
ity; (2) business necessity; or (3) the presence of 
the disabled worker would pose a direct threat to 
the health or safety of the worker or others. 

An important stipulation of ADA is that med- 
ical tests may not be administered prior to an offer 
of employment. Unfortunately, what constitutes a 
“medical test” is not well defined by the act. In par- 
ticular, it is possible that intelligence tests might be 
construed as “medical” in nature, which could 
wreak havoc with employment testing: 


According to ADA requirements, if an attribute is 
not required for performing an essential task, then 
an applicant may request an accommodation or 
modification of either the testing process or the job 
if he or she claims a covered disability that is asso- 
ciated with that nonessential attribute. In practice, 
this might mean that unless it is demonstrated that 
intelligence is required for accomplishing an essen- 
tial task, no test that measures intelligence (or any 
facet of intelligence) could be administered before 
offering employment to any applicant claiming an 
impairment that is associated with intellectual 
functioning. (Landy, Shankster, & Kohler, 1994) 


It is still too soon to determine the impact of 
ADA upon the practice of personnel selection. 
There is confusion about which disabilities are cov- 
ered by the act, uncertainty about which selection 
practices are forbidden, and anxiety over how many 
people will seek accommodation under the act 
(Klimoski & Palmer, 1994). Court decisions and 
administrative guidelines will be needed to sharpen 
the focus of this important legislation. 

The Civil Rights Act of 1991 also contained im- 
portant provisions relevant to employee selection 
and appraisal. Specifically, the act outlaws sub- 
group norming of test scores, which effectively 
eliminates the use of separate hiring and promotion 
lists. Subgroup norming refers to the practice of 
using identified subgroups (instead of a diversified 
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national sample) for purposes of developing group- 
specific test norms. The prohibition of this practice 
presents a challenge to employers and I/O psy- 
chologists, since racial subgroup norming of test 
scores has been a popular and effective method for 
avoiding adverse impact. 

Recent court cases also have impacted person- 
nel testing. The issue in Soraka v. Dayton Hudson 
was whether corporations can use a personality test 
as a basis for preemployment screening for mental 
health problems in job applicants. As discussed pre- 
viously, Soraka was required to take the Rodgers 
Psychscreen as part of the application process for a 
position as security guard. The Psychscreen is a 
true-false personality inventory intended to identify 


persons with psychological problems such as: de- 
pression and anxiety. Soraka filed suit against the 
department store, claiming that individual questions 
about his sexual practices and religious beliefs were 
a violation of his civil rights. This case was inter- 
esting because it pertained to the value and validity 
of individual items as opposed to overall test scores. 
The courts have long held that preemployment test- 
ing must have demonstrated relevance to job per- 
formance or it cannot be used. However, the courts 
have not required validity evidence for individual 
test items. Soraka won his case, which was appealed 
by Dayton Hudson. In 1993, the company settled 
out of court. This litigation is summarized in Case 
Exhibit 11.2 found earlier in this section. 


SUMMARY 


1. Performance appraisal of employees is es- 
sential to the ongoing success of any business or or- 
ganization. Applied psychological measurement is 
at the heart of performance appraisal. 


2. Performance evaluation serves many orga- 
nizational purposes, including promotions, transfers, 
layoffs, and the setting of salaries. Although objec- 
tive methods for assessing the effectiveness of em- 
ployees would appear to be preferable, judgmental 
approaches are often the only practical choice. 


3. Methods for performance appraisal include 
performance measures such as productivity counts; 
personnel data such as rate of absenteeism; peer 
ratings and self-assessments; and supervisor eval- 
uations such as rating scales. Rating scales are by 
far the most common approach. 


4. About three-quarters of all performance 
evaluations are based upon judgmental methods 
such as supervisor rating scales. The simplest rat- 
ing scale is the graphic rating scale, which consists 
of trait labels, brief definitions of those labels, and 
a continuum for the rating. 


5. The behaviorally anchored rating scale 
(BARS) is a popular form of criterion-referenced 
performance measure. A BARS form contains ex- 
plicit behavioral anchors along a continuum of ex- 
cellence that the supervisor evaluates in terms of 
past observations of work performance. 


6. Performance appraisal is subject to several 
sources of error, including failure to identify ap- 
propriate criteria for acceptable and unacceptable 
performance, halo effect (rating an employee high 
or low on all dimensions because of a global im- 
pression), rater bias, and criterion contamination. 


7. Criterion contamination occurs when a cri- 
terion measure includes factors that are not demon- 
strably part of the job, such as rating an employee 
on “dealing with the public” when this is not really 
part of the position. 


8. Appropriate guidelines for the develop- 
ment of performance appraisal systems include 
basing the appraisal method upon a careful job 
analysis; training raters to be fair, accurate, and 
legal; and evaluating the performance appraisal sys- 
tem periodically. 

9, Employee testing and appraisal is carefully 
circumscribed by legal and regulatory guidelines. 
For example, Title VII of the Civil Rights Act of 
1964 prohibits employment practices that discrim- 
inate on the basis of race, color, religion, sex, or na- 
tional origin. 

10. Several court cases have helped to shape 
testing practices in personnel selection. For exam- 
ple, in Griggs v. Duke Power (1971) the Supreme 
Court ruled that fairness in employment testing is 
determined by consequences, not motivations; 
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testing practices must have a demonstrable link to 
job performance; and the employer must show that 
a testing practice is job related. 


11. Several federal agencies and professional 
groups helped develop the Uniform Guidelines on 
Employee Selection (1978). This document pro- 
vides guidance on many employee-testing prac- 
tices, including a very specific definition of adverse 
impact. 

12. In general, when selection procedures 
favor applicants from one group (usually males or 
whites), the basis for selection is said to have an ad- 
verse impact on other groups (usually females or 
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- nonwhites) when they have a lower selection pro- 


portion (less than four-fifths of the majority group). 


13. As a consequence of Watson v. Fort Worth 
Bank and Trust (1988), employers must now be 
ready to defend all their promotion practices— 
including subjective interview—against claims of 
adverse impact. 


14. The Americans with Disabilities Act and 
the Civil Rights Act of 1991 also contained impor- 
tant provisions relevant to employee selection and 
appraisal. For example, the Civil Rights Act out- 
laws subgroup norming of tests. 


KEY TERMS AND CONCEPTS 


criterion problem p. 426 

graphic rating scale p. 427 

critical incidents checklist p. 428 
behaviorally anchored rating scale p. 429 
behavior observation scale p. 429 


forced-choice scale p. 429 
halo effect p. 431 

rater bias p. 432 

criterion contamination p. 432 
adverse impact p. 436 


HAPTER 


12 






Torpıc12A 


Attitudes, Interests, and 
Values Assessment 


Interests and Values in Vocational Assessment 


The Assessment of Life Values 

An Overview of Interest Assessment 
Inventories for Interest Assessment 
Career and Work Values Assessment 
Integrative Model of Career Assessment 


Summary 
Key Terms and Concepts 


I: this chapter we examine approaches to the 
assessment of attitudes, interests, and values, 
broadly defined. Because they are formative in 
everything from work to worship, attitudes, inter- 
ests, and values are fundamental to the identity of 
each individual. It is no accident that the adolescent 
who values aesthetic harmony later reveals an in- 
terest in literature and then pursues a vocation as 
English teacher. Nor is it surprising when a shy 
teenager with an analytic bent shows a passion for 
mathematics and becomes a computer scientist. The 
values held by persons shape their interests in life, 
which, in turn, shape career choices. Lives possess 
a coherency that is explained, in part, by the influ- 
ence of interests and values. 
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Values not only link the individual to the world 
of work, they are intertwined in moral, spiritual, 
and religious matters as well. Whether we favor or 
oppose capital punishment, whether we find life 
meaningful or merely chaotic, whether we seek or 
avoid religious practice—these matters we resolve 
based upon personal values. In sum, the choices we 
make in matters of work, spiritual life, and personal 
conduct are not random, they are bound together by 
common threads that we call interests and values. 

A problem faced by many young adults is that 
their values are unstated and their interests are un- 
explored. Furthermore, they lack knowledge about 
career options. In these cases, career selection can 
arouse anxiety, and perhaps it should. Lowman 
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(1991) has noted that the process of finding a vo- 
cation can be as complex and as difficult as choos- 
ing a mate. The dilemma of career choice is not 
limited to young adults entering the job market, but 
also vexes older workers who are dissatisfied with 
their careers. Fortunately, a large array of tests and 
guidance approaches are available to help individ- 
uals identify values, interests, and potential career 
choices, as reviewed in this topic. 

In Topic 12A, Interests and Values in Vocational 
Assessment, we survey the measurement of values 
and interests, especially as these concepts apply to 
vocational choice. We begin with a quick overview 
of historically relevant tests for the evaluation of 
general life values, and then turn to the application 
of specialized tests for career assessment and ad- 
vising. In Topic 12B, Attitudes and the Assessment 
of Moral and Spiritual Concepts, we introduce the 
reader to methods and concerns in the measurement 
of attitudes, and then present assessment ap- 
proaches pertinent to the moral, spiritual, and reli- 
gious dimensions of the individual. 


| THE ASSESSMENT OF LIFE VALUES 


In the popular media we find frequent reference to 
values and changes in values at the individual and 
national level. Politicians deplore the decline of 
family values, magazine editors denounce the ab- 
sence of altruistic volunteerism, and columnists 
disparage the reemergence of materialism and ca- 
reerism. Religious leaders enter the fray, too. As an 
antidote to global cynicism, they call for a return to 
spiritual values that affirm the meaning of life. 
Practically everyone has an opinion about values— 
especially in regard to the presumed values of other 
persons or groups. 

But what are values and how can they be mea- 
sured? Although a huge amount of literature exists 
on the nature and definition of values, there is sur- 
prisingly little empirical research on their mea- 
surement. In general, psychologists define a value 
as.a shared, enduring belief about ideal modes of 
behavior or end states of existence (Rokeach, 
1980). Values instill action, shape attitudes, and 
guide efforts to influence others. Values also arise 


in response to societal conditions and are therefore 
malleable to some degree (Ball-Rokeach, Rokeach, 
& Grube, 1984). 

In this topic, we examine key issues and im- 
portant tests that pertain to the assessment of per- 
sonal values, broadly defined. We begin with a 
critique of wideband instruments that assess life 
values—the social ends or goals considered desir- 
able of achievement. The chapter then reviews as- 
sessment approaches in the moral, spiritual, and 
religious domains. This includes lengthy coverage 
of Kohlberg’s (1981, 1984) classic method for the 
measurement ‘of moral reasoning. We close with 
brief coverage of the overlooked literature on the 
measurement of spiritual and religious concepts. 

Values are important because they provide a 
pervasive framework for personal actions and judg- 
ments. When we know the life values of an indi- 
vidual, we can predict typical behaviors and 
surmise likely attitudes. In a classic work on the 
topic, Rokeach (1968) underscores the importance 
of values: 


To say that a person “has a value” is to say that he 
has an enduring belief that a specific mode of con- 
duct or end-state of existence is personally and so- 
cially preferable to alternative modes of conduct or 
end-states of existence. Once a value is internalized 
it becomes, consciously or unconsciously, a stan- 
dard or criterion for guiding action, for developing 
and maintaining attitudes toward relevant objects 
and situations, for justifying one’s own and others’ 
actions and attitudes, for morally judging self and 
others, and for comparing self with others. Finally, 
a value is a standard employed to influence the val- 
ues, attitudes, and actions of at least some others— 
our children’s, for example. (pp. 159-160) 


This view that values are in some sense primary and 
formative also has been advanced by Kluckhohn 
(1951) and Smith (1963). 

Values are more easily defined than measured. 
Few value scales have withstood the test of time. 
We survey three instruments here: the Study of Val- 
ues is an interesting test mainly of historical im- 
portance; the Rokeach Value Survey is a highly 
respected research tool; the Values Inventory pro- 
vides a cautionary illustration that bad tests occa- 
sionally do make their way into publication. 
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Study of Values 


Psychologists have been interested in the assess- 
ment of personal values since early in the twentieth 
century. However, it is only in the last 30 years that 
psychometrically sound self-report measures of 
values have been developed. An early instrument in 
this vein was the Study of Values (SOV), an inven- 
tory designed to measure six basic evaluative atti- 
tudes: Theoretical (T), Economic (E), Aesthetic (A), 
Social (S), Political (P), and Religious (R) (Allport 
& Vernon, 1931; Allport, Vernon, & Lindzey, 1960). 
These six values were patterned directly after 
Spranger’s (1928) Types of Men. In this influential 
book, the German intellectual Eduard Spranger ar- 
gued that most people display one of the following 
as a dominant value that defines their personality: 


Theoretical (T): The dominant interest of the the- 
oretical person is the discovery of truth. 
Economic (E): The economic person is primar- 
ily interested in what is useful. 

Aesthetic (A): The aesthetic person sees the high- 
est value in form and harmony. 

Social (S): Love of people is the highest value for 
the social person. 

Political (P): The political person is interested 
primarily in power. 

Religious (R): The religious person places the 
highest value upon mystical unity with the 
cosmos. 


The SOV scale consists of 30 questions which pit 
one value against another, and another 15 questions 
that require the rank ordering of values. Examples 
of the questions include the following: 


¢ When you visit a church are you more impressed 
by a pervading sense of reverence and worship 
or by the architectural features and stained glass? 
[Religious versus Aesthetic] 

e In your opinion, has general progress been ad- 
vanced more by the freeing of slaves, with the en- 
hancement of the value placed on individual life, 
or by the discovery of the steam engine, with the 
consequent industrialization and economic ri- 
valry of European and American countries? [So- 
cial versus Economic] 


From answers to the forced-choice questions and 
the rank ordering of values, a profile of values is 
plotted in ipsative manner, displaying the relative 
strength of the six values for each individual. 

Lubinski, Schmidt, and Benbow (1996) dem- 
onstrated the merit of testing values with the SOV 
in a 20-year follow-up study of 203 intellectually 
gifted adolescents. Their gifted sample was first 
tested at age 13 and then again as adults at age 33. 
In general, the six themes revealed significant sta- 
bility over this time period, with mean interindi- 
vidual correlations of .37 for the various themes. 
This is remarkable, given that the teenage and 
young adult years are assumed to be a period of tur- 
moil and change, especially in personal values, as 
young persons struggle to find an identity. Sex dif- 
ferences were notable: Males tended to shift toward 
a T-E-P profile as adults whereas females tended to 
shift toward an A-S-R profile. Even so, a common 
pattern was observed for all participants, with 
Aesthetic and Economic values taking on more 
saliency in young adulthood and Political and So- 
cial values revealing less dominance. 

The Study of Values has provoked considerable 
discussion as a classroom demonstration tool in 
psychology courses, but otherwise has not been an 
influential test. A major problem with the instru- 
ment is that the six values are vaguely defined and 
too general to be of practical use. Nonetheless, the 
test did inspire others to develop more sophisticated 
and comprehensive approaches to values assess- 
ment. One of those who acknowledged a debt to 
Allport and the Study of Values was Milton 
Rokeach. 


Rokeach Value Survey 


Rokeach (1973) defined two kinds of values, in- 
strumental and terminal. Instrumental values are 
desirable modes of conduct, whereas terminal val- 
ues are desirable end states of existence. For ex- 
ample, ambition is an instrumental value, whereas 
family security is a terminal value. In devising the 
Rokeach Value Survey, a final list of 18 instru- 
mental values was arrived at by condensing 555 
“personality-trait” names into near-synonyms. The 
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TABLE 12.1 The 36 Value Constructs from the 
Rokeach Value Survey, Form D 

Terminal Values 

A Comfortable Life 

An Exciting Life 

A Sense of Accomplishment 


Inner Harmony 
Mature Love 
National Security 


A World at Peace Pleasure 

A World of Beauty Salvation 
Equality Self-Respect 
Family Security Social Recognition 
Freedom True Friendship 
Happiness Wisdom 
Instrumental Values 

Ambitious Imaginative 
Broadminded Independent 
Capable Intellectual 
Cheerful Logical 

Clean Loving 
Courageous Obedient 
Forgiving Polite 

Helpful Responsible 
Honest Self-Controlled 





final list of 18 terminal values was derived from 
literature survey and other subjective, impres- 
sionistic approaches. The 36 values are listed in 
Table 12.1. 

Although the individual values are not defined 
in detail, each is accompanied by a short phrase or 
synonyms to clarify the item for respondents. For 
example, the first of the terminal values reads as 
follows: “A COMFORTABLE LIFE (a prosperous 
life)” Completing the survey is extremely simple. 
Respondents are asked to rank separately the 18 ter- 
minal and 18 instrumental values based on “their 
importance to you, as guiding principles in your 
life.” The values are printed on gummed labels (for 
Form D). Subjects merely peel off the labels and 
arrange them in order of importance, removing and 
reattaching as needed. The rank for each item be- 
comes the score for that value. Ties are not allowed, 
so value scores will range from 1 to 18, with lower 
scores indicating greater importance. 


Reliability of the Value Survey can be ap- 
proached in two ways. The first is the temporal sta- 
bility of rank orderings for individual subjects. For 
this approach, the scale is administered twice and 
the two sets of rank orderings are correlated for 
each individual. Using this approach with four 
groups of college students (retest intervals of three 
weeks to four months), Rokeach (1973) reported 
median test-retest correlations ranging from .76 to 
.80 for terminal values, and .65 to .72 for instru- 
mental values. The second way to examine relia- 
bility is to calculate the test-retest reliability of 
individual value scores separately, across all re- 
spondents. Using this approach, reliability of the 
individual scales is lower, about .65 for the termi- 
nal values and .56 for the instrumental values 
(Rokeach, 1973). These reliabilities are rather low 
in comparison to instruments with more items per 
scale—which is not surprising. After all, the 
“scales” on the Value Survey each consist of a sin- 
gle item. Nonetheless, with reliabilities this low, the 
Value Survey should be used only for research pur- 
poses such as description or comparison of group 
values. Individual interpretation for counseling 
purposes cannot be supported. 

In an intriguing example of its application in re- 
search, Rokeach and his colleagues used the Value 
Survey to measure the effects of viewing a single 
30-minute television program on values, attitudes, 
and behaviors (Ball-Rokeach, Rokeach, & Grube, 
1984). The television program, hosted by Ed 
Asner and known as “The Great American Values 
Test,” was specially designed to influence viewers’ 
ratings of the importance of the terminal values 
of freedom and equality. For example, over a full- 
screen graphic display indicating that Americans 
had ranked freedom third and equality twelfth, 
on average, among 18 terminal values, Asner 
commented: 


Americans feel that freedom is very important. 
They rank it third. But they also feel that equality 
is considerably less important . . . they rank it 
twelfth. Since most Americans value freedom 

far higher than they value equality, the question 
is: what does that mean? Does it suggest that 
Americans as a whole are much more interested 
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in their own freedom than they are in freedom for 
other people? Is there a contradiction in the Ameri- 
can people between their love of freedom and their 
lesser love for equality? 

By comparing your values with these results, 
you should be able to decide for yourself whether 
you agree with the average American’s feelings 
about freedom and equality. (Ball-Rokeach et al., 
1984) 


A full discussion of this study would involve a 
lengthy detour away from the topic of psychologi- 
cal testing. However, the reader may appreciate a 
quick summary. The authors used a tightly con- 
trolled pretest-posttest design with experimental 
and control cities to determine the effects of view- 
ing the program. For viewers who watched the 
show without interruption, mean rankings on 
equality went from 11.0 to 9.3, whereas for non- 
viewers the ratings on this value were quite stable. 
A number of other experimental checks (e.g., so- 
liciting donations to provide cultural opportunities 
for African American children) also confirmed a 
real change in values. This study is a good exam- 
ple of the kind of social research for which the 
Value Survey is well suited. 


Limitations of the Rokeach Value Survey 


We have already mentioned that the individual 
scales of the Value Survey possess marginal relia- 
bility—which means that the instrument should 
not be used for individual guidance. Several addi- 
tional limitations stem from the ipsative nature of 
the test. The reader will recall that an ipsative test 
is one in which the average of the scales is always 
the same for every examinee. In particular, the 
average rank for the 18 instrumental values will 
always be 9.5, and likewise for the terminal values. 
By definition, when an examinee gives some 
scales a high ranking, others must receive a low 
ranking. What is lost in this process is any absolute 
measure of the value for that individual. Suppose, 
for example, that we could measure the absolute 
strength of the 18 instrumental values on a scale 
from 1 to 100 (note: this is not possible with the 
Value Survey). Consider the case in which indi- 
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vidual A has an absolute strength of 99 for ambi- 
tious and 98 for obedient with all other values 
below 90, whereas individual B has an absolute 
strength of 39 for ambitious and 19 for obedient 
with all other values below 10. Most likely, indi- 
vidual A would value ambition and obedience to a 
high degree, whereas individual B modestly val- 
ues ambition and devalues obedience. In fact, in- 
dividual B could be characterized as almost 
valueless. Yet, both persons would receive scores 
of 1 for ambitious and 2 for obedient. The Value 
Survey is not sensitive to magnitude differences 
within individual subjects, nor does it capture scal- 
ing differences between individuals. 

Braithwaite and Law (1985) call attention to 
additional weaknesses of the Value Survey. They 
note that the inventory omits several important 
values, including physical well-being, individual 
rights, thriftiness, and carefreeness. Perhaps more 
significant, they criticize the Rokeach test for rely- 
ing upon a single item for each value instead of 
using multi-item indices for the value constructs. 
They propose an alternative instrument (based on 
the Rokeach approach) that would presumably em- 
body improved psychometric qualities in the mea- 
surement of personal values. 


INTEREST ASSESSMENT 


In most applications of psychological testing, the 
goals of assessment are reasonably clear. For ex- 
ample, intelligence testing helps predict school per- 
formance; aptitude testing foretells potential for 
accomplishment; and personality testing provides 
information about social and emotional function- 
ing. But what is the purpose of interest assessment? 
Why would a psychologist recommend it? What 
can a client expect to gain from a survey of his or 
her interests? 

Interest assessment promotes two compatible 
goals: life satisfaction and vocational productivity. 
It is nearly self-evident that a good fit between in- 
dividual interests and chosen vocation will help 
foster personal life satisfaction. After all, when 
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work is interesting we are more likely to experience 
personal fulfillment as well. In addition, persons 
who are satisfied with their work are more likely to 
be productive. Thus, employees and employers 
both stand to gain from the artful application of in- 
terest assessment: Several useful instruments exist 
for this purpose, and we will review the most 
widely used interest inventories later. 

In the selection of employees, the consideration 
of personal interests may be of great practical sig- 
nificance to employers and therefore circumstan- 
tially relevant to the job candidates as well. We may 
sketch out a rough equation as follows: productiv- 
ity = ability x interest. In other words, high ability 
in a specific field does not guarantee success; nei- 
ther does high interest level. The best predictions 
are possible when both variables are considered to- 
gether. Thus, employers have good reason to deter- 
mine whether a potential employee is well matched 
to the position; the employee should like to know 
as well. 

We begin with a critical examination of major 
interest tests. The six instruments chosen for review 
include the following: 


The Strong Interest Inventory (SID), the latest re- 
vision of the well-known Strong Vocational In- 
terest Blank (SVIB) 

The Jackson Vocational Interest Survey (JVIS), 
a test that embodies modern methods for scale 
construction 

The Kuder General Interest Survey (KGIS), an 
instrument that incorporates a divergent philoso- 
phy of test construction 

The Vocational Preference Inventory (VPI), which 
measures six widely used vocational themes 

The Self-Directed Search (SDS), a self-adminis- 
tered and seif-scored guide to exploring career 
options 

The Campbell Interest and Skill Survey (CISS), 
a recent and appealing test that is simple in for- 
mat but sophisticated in execution 


The review of prominent interest tests is followed 
by the related topic of assessment in career and 
work values. 


INTEREST ASSESSMENT 
Strong Interest Inventory (SII) 


I INVENTORIES FOR 


The Strong Interest Inventory (SID) is the latest revi- 
sion of the Strong Vocational Interest Blank (SVIB), 
one of the oldest and most prominent instruments in 
psychological testing (Strong, Hansen, & Camp- 
bell, 1994). We can best understand the SII by study- 
ing the history of its esteemed predecessor, the 
SVIB. In particular, we need to review the guiding 
assumptions used in the construction of the SVIB 
that have been carried over into the SII. 

The first edition of the SVIB appeared in 1927, 
eight years after E. K. Strong formulated the 
essential procedures for measuring occupational in- 
terests while attending a seminar at the Carnegie 
Institute of Technology (Campbell, 1971; Strong, 
1927). In constructing the SVIB, Strong employed 
two little-used techniques in measurement. First, 
the examinee was asked to express liking or dislik- 
ing for a large and varied sample of occupations, 
educational disciplines, personality types, and 
recreational activities. Second, the responses were 
empirically keyed for specific occupations. In an 
empirical key, a specific response (e.g., liking to 
roller skate) is assigned to the scale for a particular 


_ occupation only if successful persons in that occu- 


pation tend to answer in that manner more often 
than comparison subjects. 

Although Strong did not express his underlying 
assumptions in a simple and straightforward man- 
ner, it is clear that the theoretical foundation for the 
SVIB derives from a typological, trait-oriented 
conception of personality. Tzeng (1987) has iden- 
tified the following basic assumptions in the devel- 
opment and application of the SVIB: 


1. Each occupation has a desirable pattern of in- 
terests and personality characteristics among its 
workers. The ideal pattern is represented by suc- 
cessful people in that occupation. 

2. Each individual has relatively stable interests 
and personality traits. When such interests and 
traits match the desirable interest patterns of the 
occupation the individual has a high probability 


TOPIC 12A 


to enter that occupation and be more likely to 
succeed in it. 

3. It is highly possible to differentiate individuals 
in a given occupation from others-in-general in 
terms of the desirable patterns of interests and 
traits for that occupation. 


Strong constructed the scales of his inventory 
by contrasting the responses of several specific oc- 
cupational criterion groups with those of a people- 
in-general group. The subjects for each criterion 
group were workers in that occupation who were 
satisfied with their jobs and who had been so em- 
ployed for at least three years. The items that dif- 
ferentiated the two groups, keyed in the appropriate 
direction, were selected for each occupational 
scale. For example, if members of a specific occu- 
pational group disliked “buying merchandise for a 
store” more often than people-in-general, then that 
item (keyed in the dislike direction) was added to 
the scale for that occupation. 

The first SVIB consisted of 420 items and a 
mere handful of occupational scales (Strong, 
1927). Separate editions for men and women fol- 
lowed shortly. The inventory has undergone nu- 
merous revisions over the years (Tzeng, 1987), 
culminating in the modern instrument known as 
the Strong Interest Inventory (Campbell, 1974; 
Hansen, 1992; Hansen & Campbell, 1985). 

Although the Strong Interest Inventory (SII) 
was fashioned according to the same philosophy as 
the SVIB, the latest revision departs from its pre- 
decessors in three crucial ways: 


1. The SII merges the men’s and women’s forms 
into a single edition. 

2. The SII introduces a theoretical framework to 
guide the organization and interpretation of 
scores, as discussed later. 

3. The SII incorporates a substantial increase in the 
number of occupational scales, particularly in 
the vocational/technical areas underrepresented 
in the SVIB. 


The SII consists of 317 items grouped into 
seven sections. In the first five sections, the exam- 
inee records “Like,” “Indifferent,” or “Dislike” for 
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TABLE 12.2 Characteristic Items from the 
Strong Interest Inventory 


Mark Like, Indifferent, or Dislike next to the following 
items. 
1. Driving a truck 
2. Being a fish and game officer 
3. Chemistry 
4. Doing applied research 
5. Acting in a drama 
6. Magazines about music 
7. Sociology 
8. Fundraising for charities 
9. Buying goods for a store 
10. People who are leaders 
11. Regular work hours 
12. Assertive people 


LT 





occupations, school subjects, activities, leisure ac- 
tivities, and contact with different types of persons 
(Table 12.2)..A sixth part requires the examinee to 
express a preference between paired items (e.g., 
dealing with things versus dealing with people). 
The seventh section consists of self-descriptive 
statements which the examinee marks “Yes,” “No,” 
08573 

The SII can only be scored by prepaid answer 
sheets or booklets that are mailed or faxed to the 
publisher, or through purchase of a software system 
that provides on-site scoring for immediate results. 
The results consist of a lengthy printout that is or- 
ganized according to several themes. All scores are 
expressed as standard scores with a mean of 50 and 
an SD of 10. Normative results for men and women 
are reported separately, but cross-sex comparisons 
can be achieved by simple visual transposition. 

At the most global level are the six General 
Occupational Theme Scores, namely, Realistic, In- 
vestigative, Artistic, Social, Enterprising, and Con- 
ventional. These theme scores were based upon the 
theoretical analysis of Holland (1966, 1985ab), 
whose work we discuss later. Each theme score per- 
tains to a major interest area that describes both a 
work environment and a type of person. For exam- 
ple, persons scoring high on the Realistic theme are 
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generally quite robust, have difficulty expressing 
their feelings, and prefer to work outdoors with 
heavy machinery. Within the theme scores can be 
found 25 Basic Interest Scales such as Adventure, 
Mathematics, and Social Science. The interest 
scales are empirically derived and consist of sub- 
stantially intercorrelated items. 

The most specific results consist of 211 scores 
for the Occupational Scales. In the 1985 revision of 
the SII, these scales were constructed in the usual 
manner by comparing responses of persons em- 
ployed in the given occupation versus samples of 
men-in-general and women-in-general (Hansen, 
1992; Hansen & Campbell, 1985). Sample sizes for 
the criterion groups ranged from 60 to 420, with 
most groups containing 200 or more persons. The 
criterion groups consisted of persons between the 
ages of 25 and 60 years, satisfied with their occu- 
pation, meeting certain minimum standards of suc- 
cessful employment, and employed in the given 
occupation for at least three years. Standardization 
of the 1985 version involved the testing of over 
140,000 persons, of whom only 50,000 met the cri- 
teria for scale development. 

A recent innovation on the SII is the addition of 
personal style scales (Harmon, Hansen, Borgen, & 
Hammer, 1994). These are designed to measure 
preferences for broad styles of living and working. 
These scales assist in vocational guidance by show- 
ing level of comfort with distinctive styles. The four 
style scales are 


1. Work Style, on which a high score indicates a 
preference to work with people and a low score 
signifies an interest in ideas, data, and things; 

2. Learning Environment, on which a high score 
indicates a preference for academic learning en- 
vironments and a low score indicates a prefer- 
ence for more applied learning activities; 

3. Leadership Style, on which a high score indi- 
cates comfort in taking charge of others and a 
low score indicates uneasiness; and 

4. Risk Taking/Adventure, on which a high score 
indicates a preference for risky and adventurous 
activities as opposed to safe and predictable 
activities. 


The personal style scales each have a mean of 50 
and a standard deviation of 10. Note that these are 
truly bipolar scales for which each pole is distinct 
and meaningful. 


Evaluation of the SII 


The SII represents the culmination of over 50 years 
of study, involving literally thousands of research 
reports and hundreds of thousands of respondents. 
In evaluating this instrument, we can only outline 
basic trends in the research, referring the reader to 
other sources for details (Savickas, Taber, & 
Spokane, 2002; Tzeng, 1987; Campbell & Hansen, 
1981; Hansen, 1984, 1987, 1992; Hansen & Camp- 
bell, 1985). We should also point out that evalua- 
tions of the reliability and validity of the SII are 
based in part upon its similarity to the SVIB, for 
which a huge amount of technical data exists. 

Based upon test-retest studies, the reliability of 
the SII-SVIB has proved to be exceptionally good 
in the short run, with one- and two-week stability 
coefficients for the occupational scales generally in 
the .90s. When the test-retest interval is years or 
decades, the correlations drop to the .60s and .70s 
for the occupational scales, except for respondents 
who were older (over age 25) upon first testing. For 
younger respondents first tested as adolescents, the 
median test-retest correlation after 15 years is 
around .50 (Lubinski, Benbow, & Ryan, 1995). But 
for older respondents, first tested after the age of 
25, the median test-retest correlation 10 to 20 years 
later is a phenomenal .80 (Campbell, 1971). Ap- 
parently, by the time we pass through young adult- 
hood, personal interests become extremely stable. 
The questions on the SII-SVIB capture that stabil- 
ity in the occupational scores, providing support for 
the trait conception of personality upon which these 
instruments were based. 

The validity of the SII-SVIB is premised 
largely on the ability of the initial occupational pro- 
file to predict the occupation eventually pursued. 
Strong (1955) reported that the chances were about 
two in three that people would be in occupations 
predicted by high occupational scale scores, and 
about one in five that respondents would be in oc- 


TOPIC 12A INTERESTS AND VALUES IN VOCATIONAL ASSESSMENT 449 


cupations for which they had shown little interest 
when tested. Although other researchers have quib- 
bled with the exact proportions (Dolliver, Irvin, & 
Bigley, 1972), it is clear that the SII-SVIB has im- 
pressive hit rates in predicting occupational entry. 
The instrument functions even better in predicting 
the occupations that an examinee will not enter. In 
a recent study, Donnay and Borgen (1996) provide 
evidence for construct validity by demonstrating 
strong overall differentiation between 50 occupa- 
tional groups on the SII: 


The big picture is that people in diverse occupa- 
tions show large and predictable differences in 
likes and dislikes, whether in terms of vocational 
interests or in terms of personal styles. And the 
Strong provides valid, structural, and comprehen- 
sive measures of these differences. (p. 290) 


The SII is used mainly with high school and col- 
lege students and adults seeking vocational guid- 
ance or advice on continued education. Because 
most students’ interests are undeveloped and unsta- 
bilized prior to age 13 or 14, the SII is not recom- 
mended for use below high-school level. As evident 
in the reliability data reported, the SII becomes in- 
creasingly valuable with older subjects, and it is not 
unusual to see middle-aged persons use the results 
of this instrument for guidance in career change. 


Jackson Vocational Interest Survey (JVIS) 


The Jackson Vocational Interest Survey (JVIS) is a 
relatively new instrument that contrasts sharply in 
several respects with the SII (Jackson, 1977; Ver- 
hoeve, 1993). The 34 basic interest scales on the 
JVIS are composed of two different types, work 
role scales and work style scales. The 26 work role 
scales measure specific interests pertinent to broad 
occupational themes such as mathematics, life sci- 
ence, adventure, business, and teaching. The 8 
work style scales were designed to measure prefer- 
ences for working in environments that require par- 
ticular modes of behavior, such as job security, 
dominant leadership, accountability, and stamina. 
The JVIS may be hand scored, but computer scor- 
ing is probably preferable since the user then ob- 


tains several additional groups of scales, including 
data on examinees’ similarity to college students 
majoring in specific academic disciplines. The 
JVIS is suited to high school age and older. 

Several features distinguish the JVIS from the 
SII and other interest inventories. First, the JVIS 
employs a forced-choice ipsative format whereby 
examinees must select their preferred choice from 
two alternatives. Items on the JVIS resemble the 
following: 


. Acting in a school drama. 
. Teaching kids how to write. 


. Quilting bedspreads with ornate designs. 
. Buying furniture for a chain of stores. 


. Writing a mathematics text for grade school 
children. 

. Studying the financial growth of a local 
bank. 
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Although rarely used, the forced-choice item for- 
mat has the advantage of reducing the impact of 
social desirability upon test results. A second dis- 
tinctive feature of the JVIS is that Jackson used a 
rational and theory-guided method in the derivation 
of scales, as opposed to the empirical approach 
found in most other instruments. As a result of 
these two features, the JVIS scales possess a greater 
independence from one another than found on other 
instruments and are also quite factorially pure. As 
evidence of factorial homogeneity of the scales, 
biserial correlations between item endorsements 
and scale scores are typically in the high .60s and 
low .70s. 

The JVIS is normed on a very large sample— 
approximately 8,000 high school and college stu- 
dents. However, these subjects consist mainly of 
students from Pennsylvania and the Province of 
Ontario, so their representation of the general pop- 
ulation is questionable. Reliability is excellent, at 
least in the short range, with one- to two-week test- 
retest coefficients typically in the mid-.80s. Based 
on an eclectic group of studies reported in the man- 
ual, concurrent and predictive validity appear 
promising, but additional studies are needed to bol- 
ster confidence in this instrument (Shepard, 1989). 


450 _CHAPTER12 ATTITUDES, INTERESTS, AND VALUES ASSESSMENT 


Kuder General Interest Survey 


The Kuder General Interest Survey (KGIS) repre- 
sents the most recent evolution of a series of highly 
respected Kuder vocational interest inventories 
developed over the last 50 years. The first of these 
instruments, the Kuder Preference Record, was 
published in 1939. This instrument introduced an 
interesting forced-choice response format that has 
survived into the present (discussed later). The 
Preference Record underwent several revisions and 
emerged in 1979 as the Kuder Occupational Inter- 
est Survey-Revised (KOIS-R; Kuder & Diamond, 
1979). The KOIS-R is a well-known test that pro- 
duces scores for over 100 specific occupational 
groups and nearly 50 college majors. The target 
population for the KOIS-R is roughly the same as 
for the SII and the JVIB. For purposes of present- 
ing a diversity of interest tests, it is more instruc- 
tive to discuss the KGIS here. 

The KGIS is unique among interest inventories 
in that its target population is restricted to adoles- 
cents in grades six through twelve (Kuder, 1975). 
The test requires only a sixth-grade reading level and 
may be administered by the classroom teacher and 
hand scored on site. Thus, the KGIS is well suited to 
the development of educational and vocational goals 
in the early formative years of adolescence. 

The KGIS is also unusual in its methodology: 
The inventory uses a forced-response triad format 
to measure interests. Specifically, each item on the 
test requires the examinee to indicate most- and 
least-liked alternatives from three statements. This 
forced-choice approach is particularly suited to 
identifying examinees who have not answered the 
items sincerely. 

The 168-item inventory produces 10 interest 
scores that are largely ipsative in nature. The reader 
will recall that scores on an ipsative test reflect in- 
traindividual variability rather than interindividual 
variability. With the KGIS, comparison to an ex- 
ternal reference group is of secondary importance 
in determining scores.! Thus, a high score in one 


1. In truth, the KGIS is only partially ipsative. Some of the item 
triads are scored for more than one scale. 


interest area mainly means that the examinee pre- 
ferred that area more often than the others in the 
forced-choice items. 

The 10 scales reflect broad areas of interest: 
Outdoor, Mechanical, Computational, Scientific, 
Persuasive, Artistic, Literary, Musical, Social 
Service, and Clerical. An eleventh scale, the Verifi- 
cation Scale, is designed to determine the sincerity 
of the responses. The manual reports extensive test- 
retest, internal consistency, and stability data based 
on a sample of 9,819 students in grades 6 through 
12. The six-week test-retest and internal consis- 
tency data are generally acceptable, with the older 
students showing higher test-retest correlations. 
The possible exception to good reliability is the Per- 
suasive Scale (pertinent to sales positions), which 
shows test-retest correlations of .69 and .73, re- 
spectively, for boys and girls in grades 6 through 8. 

Stability data over a four-year follow-up are 
less impressive. The mean stability coefficient is 
only .50, and for low-IQ subjects (below 100) it is 
even lower, as low as .19 for the Clerical Scale. This 
is unfortunate because low-IQ adolescents would 
be more likely to enter clerical fields than high-IQ 
adolescents. Yet, measurement of clerical interests 
is highly unstable for precisely this group. 


Comment on the KGIS and 
Other Interest Inventories 


Considering the difficulty of the task it under- 
takes—measuring the broad interest patterns of 
adolescents—the KGIS performs at an acceptable 
level. In grades 6 through 8, results of the KGIS 
may spur students to explore new experiences per- 
tinent to their measured interests; in grades 9 and 
10, results may help students plan high school 
courses; and in grades 11 and 12, the results can 
help students make tentative vocational choices. 

But the KGIS suffers the same pivotal short- 
coming of all existing interest inventories, a total 
inattention to opportunity. Williams and Williams 
(1985) have expressed this point well: 

For those specifically looking for a measure of in- 


terest, the Kuder is definitely an acceptable mea- 
sure. But interest is only one prong in the 
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triumvirate of interest-ability-opportunity. The 
most important prong, opportunity, has generated 
the least psychometric interest. That this would be 
so is not surprising. Opportunity is by far the hard- 
est construct to define, but those who deal in career 
counseling should never ignore it, regardless of the 
difficulty in measurement and definition. 


We remind the reader that the inattention to oppor- 
tunity is common to all interest measures, although 
it is perhaps a more serious problem for the KGIS 
because this instrument is used with persons who 
have not yet entered the job market. 


Vocational Preference Inventory 


The Vocational Preference Inventory is an objec- 
tive, paper-and-pencil personality interest inventory 
used in vocational and career assessment (Holland, 
1985c). The VPI measures eleven dimensions, in- 
cluding the six personality-environment themes of 
Realistic, Investigative, Artistic, Social, Enterpris- 
ing, and Conventional, and five additional dimen- 
sions of Self-Control, Masculinity/Femininity, 
Status, Infrequency, and Acquiescence. The test 
items consist of 160 occupational titles toward 
which the examinee expresses a feeling by mark- 
ing y (yes) or n (no). The VPI is a brief test (15 to 
30 minutes) and is intended for persons 14 years 
and older with normal intelligence. 

Holland proposes that personality traits tend to 
cluster into a small number of vocationally relevant 
patterns, called types. For each personality type 
there is also a corresponding work environment 
best suited to that type. According to Holland, there 
are six types: Realistic, Investigative, Artistic, So- 
cial, Enterprising, and Conventional. This is some- 
times known as the RIASEC model, in reference 
to the first letters of the six types. The types are ide- 
alizations that few people (or environments) fit 
completely. Nonetheless, Holland believes that 
most individuals tend to resemble one type more 
than the others. In addition, individuals show a 
lesser degree of resemblance to a second and third 
type as well. 

We can summarize the personality-environ- 
ment types as follows: 
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e Realistic: athletic, lacks verbal and interpersonal 
skills, and prefers “hands-on” or outdoors voca- 
tions such as mechanic, farmer, or electrician 

e Investigative: task-oriented thinker with uncon- 
ventional attitudes who fits well in scientific and 
scholarly positions such as chemist, physicist, or 
biologist 

e Artistic: individualistic, avoids conventional sit- 
uations, and prefers aesthetic pursuits 

e Social: uses social competencies to solve prob- 
lems, likes to help others, and prefers teaching or 
helping professions 

e Enterprising: a leader with good selling skills 
who fits well in business and managerial positions 

® Conventional: conforming and prefers structured 
roles such as bank teller or computer operator 


The six themes in the RIASEC system can be 
arranged in a hexagon with similar themes side by 
side and dissimilar themes opposite one another, as 
depicted in Figure 12.1. 

Test-retest reliability coefficients for the six 
major scales range from .89 to .97. VPI norms are 
based upon large convenience samples of college 
students and employed adults from earlier VPI edi- 
tions. The characteristics of the standardization 
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FIGURE 12.1 Holland’s Hexagonal Model 

of Occupational Themes 

Source: Reprinted with permission from Lowman, R. L. (1991). 
The clinical practice of career assessment: Interests, abilities, 
and personality. Washington, DC: American Psychological 
Association. Copyright © 1991 by the American Psychological 
Association. Reprinted with permission. 
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sample are not well defined, which makes the 
norms somewhat difficult to interpret (Rounds, 
1985). 

The validity of the VPI is essentially tied to the 
validity of Holland’s (1985a) hexagonal model of 
vocational interests. Literally hundreds of studies 
have examined this model from different perspec- 
tives. We will cite trends and representative studies. 
The reader is referred to Holland (1985c) and 
Walsh and Holland (1992) for more details. 

Several VPI studies have investigated a key as- 
sumption of Holland’s theory—that individuals 
tend to move toward environments that are congru- 
ent with their personality types. If this assumption 
is correct, then the real-world match between work 
environments and personality types of employees 
should be substantial. We should expect to find that 
Realistic environments have mainly Realistic em- 
ployees, Social environments have mainly Social 
employees, and so on. Research on this topic has 
followed a straightforward methodology: Subjects 
are tested with the VPI and classified by their Hol- 
land types (using up to six letters); the work envi- 
ronments of the subjects are then independently 
classified by an appropriate environmental mea- 
sure; finally, the degree of congruence between 
persons and environments is computed. In better 
studies, a correction for chance agreement is also 
applied. 

Using his hexagonal model, Holland has devel- 
oped occupational codes as a basis for classifying 
work environments (Gottfredson & Holland, 1989; 
Holland, 1966, 1978, 1985c). For example, land- 
scape architect is coded as RIA (Realistic, Inves- 
tigative, Artistic) because this occupation is known 
to be a technical, skilled trade (Realistic compo- 
nent) that requires scientific skills (Investigative 
component) and also demands artistic aptitude 
(Artistic component). The Realistic component is 
listed first because it is the most important for 
landscape architect, whereas the Investigative and 
Artistic components are of secondary and tertiary 
importance, respectively. Some other occupations 
and their codes are taxi driver (RSE), mathematics 
teacher (ISC), reporter (ASE), police officer (SRE), 
real estate appraiser (ECS), and secretary (CSA). 


In a similar manner, Holland has also worked out 
codes for different college majors. 

One approach to congruence studies is to com- 
pare VPI results of students or workers with the 
Holland codes that correspond to their college ma- 
jors or occupations. For example, VPI Holland 
codes for a sample of police officers should consist 
mainly of profiles that begin with S and should con- 
tain a larger-than-chance proportion of specifically 
SRE profiles. Furthermore, the degree of congru- 
ence should be related to the degree of expressed 
satisfaction with that line of work or study. 

Research with college students provides strong 
support for the congruence prediction: Students 
tend to select and enter college majors that are 
congruent with their primary personality types 
(Holland, 1985a; Walsh & Holland, 1992). Thus, 
Artistic types tend to major in art, Investigative 
types tend to major in biology, and Enterprising 
types tend to major in business, to cite just a few ex- 
amples. These results provide strong support for the 
VPI and the theory upon which it is based. 

This short review has barely touched the sur- 
face of supportive validity studies with the VPI. 
Walsh and Holland (1992) cite several additional 
lines of research that buttress the validity of this 
test. But not all studies of the VPI affırm its valid- 
ity. Furnham, Toop, Lewis, and Fisher (1995) failed 
to find a relationship between person-environment 
(P-E) “fit” and job satisfaction, a key theoretical 
underpinning of the test. According to Holland’s 
theory, the better the P-E fit, the greater should be 
job satisfaction. In three British samples, the rela- 
tionships were weak or nonexistent, suggesting that 
the VPI does not “travel well” in cultures outside of 
the United States. 

Although we have emphasized mainly the 
strengths of the VPI up to this point, even the au- 
thors of the test acknowledge that there is room for 
improvement. For example, Walsh and Holland 
(1992) cite the following weaknesses of the VPI: 
(1) the notions about vocational environments are 
only partially tested; (2) the hypotheses about the 
person—environment interactions need consider- 
able additional research work; (3) the formulations 
about personal development have received some 
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support but need more comprehensive examina- 
tions; (4) the classification of occupations may dif- 
fer depending on the device used to assess the 
personality types; and (5) there are personal and en- 
vironmental contingencies that are currently out- 
side the scope of the theory. 

The last weakness is perhaps the most serious. 
After all, the VPI assessment approach does not 
currently recognize any role for education, intelli- 
gence, and special aptitudes, except insofar as these 
factors might indirectly bear upon personality and 
vocational interests. Yet, common sense dictates 
that intellectual ability will have a great deal to do 
with vocational satisfaction for some professions, 
independent of the match between personality type 
and work environment. For further discussion of the 
VPI and the theory upon which it is based, the in- 
terested reader is referred to Gottfredson (1990), 
Holland (1990), and Holland and Gottfredson 
(1990). 


Self-Directed Search 


Holland has always shown a keen interest in the 
practical applications of his research on vocational 
development. Consistent with this interest, he de- 
veloped the Self-Directed Search, a highly practi- 
cal, brief test that is appealing in its simplicity 
(Holland, 1985ab). As the name suggests, the 
Self-Directed Search is designed to be a self- 
administered, self-scored, and self-interpreted test 
of vocational interest. The SDS measures the six 
RIASEC vocational themes described previously. 

The SDS consists of dichotomous items that the 
examinee marks “like” or “dislike” (or “yes” or 
“no”) in four sections: (1) Activities (six scales of 
11 items each); (2) Competencies (six scales of 11 
items each); (3) Occupations (six scales of 14 items 
each); and (4) Self-Estimates (two sets of six rat- 
ings). For each section, the face-valid items are 
grouped by RIASEC themes. For each theme, the 
total number of “like” and “yes” answers is com- 
bined with the self-estimates of ability to come up 
with a total theme score. The SDS takes 30 to 50 
minutes for completion and is intended for persons 
15 years and older. 


The RIASEC themes on the SDS showed test- 
retest reliabilities that range from .56 to .95 and in- 
ternal consistencies that range from .70 to .93. 
Norms for SDS scales and codes are reported for 
pooled convenience samples of 4,675 high school 
students, 3,355 college students, and 4,250 em- 
ployed adults ages 16 through 24 (Holland, 1985ab). 
However, SDS results are typically interpreted in an 
individualized, ipsative manner (“Is this occupation 
a good fit for this client?”), so normative data are of 
limited relevance. 

The SDS is available in a hand-scored paper- 
and-pencil version and a computerized version 
as well. Unfortunately, the paper-and-pencil ver- 
sion is prone to a 16 percent clerical error rate 
when used by high school students (Holland, 
1985ab). The user-friendly microcomputer test is 
probably the preferred version because of the ease 
of administration and the error-free scoring and 
interpretation. 

When a subject takes the SDS, the three high- 
est theme scores are used to denote a summary 
code. For example, a person whose three highest 
scores were on Investigative, Artistic, and Realistic 
would have a summary code of IAR. In a separate 
booklet distributed with the test—the Occupations 
Finder—the examinee can look up his or her sum- 
mary code and find a list of occupations that pro- 
vide the best “fit.” For example, an examinee with 
an IAR summary code would learn that he or she 
most closely resembles persons in the following oc- 
cupations: anthropologist, astronomer, chemist, 
pathologist, and physicist. The test booklet contains 
additional information which helps the examinee 
explore relevant career options. 

The SDS serves a very useful purpose in pro- 
viding a quick and simple format for prompting 
young persons to examine career alternatives. By 
eliminating the time-consuming process of admin- 
istration, scoring, interpretation, and counselor feed- 
back, the test makes it possible for a wide audience 
to receive an introductory level of career counseling. 
Holland (1985ab) proposes that the SDS is appro- 
priate for up to 50 percent of students and adults who 
might desire career guidance. Presumably, the other 
50 percent would find the SDS an insufficient basis 
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for career exploration. Holland (1985ab) rightfully 
warns users to consider many sources of informa- 
tion in career choice and not to rely too heavily on 
test scores per se. Levinson (1990) discusses the in- 
tegration of SDS data with other psychoeducational 
data to make specific vocational recommendations 
for high school students. 

The validity of the SDS is linked to the validity 
of the hexagonal model of personality and envi- 
ronments upon which the test is based. One aspect 
of validity, then, is whether the model makes pre- 
dictions which are confirmed by SDS results in the 
real world. In general, the results from over 400 
studies support the construct validity of the SDS 
(Dumenci, 1995; Holland, 1985ab, 1987). 

One approach to construct validity is to deter- 
mine whether the relationships between SDS scales 
make theoretical sense. As is true of the VPI, the 
six RIASEC themes of the SDS can be arranged in 
a hexagon with similar themes side by side and dis- 
similar themes opposite one another. For example, 
in Figure 12.2, Artistic and Investigative themes are 
adjacent. It is not difficult to imagine one person 
combining these two themes in personality and 


FIGURE 12.2 

SDS Hexagonal Model and Correla- 
tions between Scales for 175 Women 
Ages 26-65 - 

Source: Adapted and reproduced by special 
permission of the publisher from the Self- 
Directed Search Professional Manual by 
John L. Holland, Ph.D. Copyright © 1985 
by PAR, Inc. All rights reserved. 


work environment, so we would predict a moder- 
ate positive correlation between them. In a general 
reference sample of 175 women ages 26 to 65 
years, Holland (1985ab) found that scores on these 
two themes correlated modestly, r = .26, as would 
be expected. The reader will also notice that the In- 
vestigative and Enterprising themes are opposite 
one another, signifying the huge disparities in these 
two occupational motifs. These themes should be 
uncorrelated. In fact, scores on these two themes 
correlated very little, r = —.02. In general, the cor- 
relations found in Figure 12.2 make theoretical 
sense; these findings support the construct validity 
of the SDS. 

The predictive validity of the SDS has been 
investigated in several dozen studies, which are 
summarized by Holland (1985ab, 1987). The typ- 
ical methodology for these studies, is that SDS 
high-point codes for large samples of students 
are compared with the first letter of their occupa- 
tional choices (or aspirations) one to three years 
later. Overall, the findings indicate that the SDS 
has moderate to high predictive efficiency, de- 
pending upon the age of the sample (hit rates go 
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up with age), the length of the time interval (hit 
rates go down with time), and the specific category 
predicted (hit rates are better for Investigative and 
Social predictions) (Gottfredson & Holland, 
1975). 

Correlations between SDS scales and a wide 
range of other psychological measures (e.g., per- 
sonality, aptitudes, and values) also serve to define 
the meaning of SDS scales and therefore help to 
validate the test (Holland, 1985ab, 1987). For ex- 
ample, a study by Costa, McCrae, and Holland 
(1984) investigated the relationship between SDS 
scales and the NEO Personality Inventory for a 
sample of 217 men and 144 women ages 21 to 89 
years. The Investigative and Artistic scales from the 
SDS showed strong positive correlations with the 
NEO Openness scale—a measure of openness to 
experience in the areas of fantasy, feelings, actions, 
and ideas. The Social and Enterprising scales from 
the SDS showed strong correlations with the NEO 
Extraversion scale—a measure of outward-direct- 
ness and sociability. The Realistic and Conven- 
tional scales from the SDS revealed only trivial 
correlations with the NEO scales. Overall, the ob- 
served correlations were consistent with the inter- 
pretation of the I, A, S, and E scales and provide 
good support for their validity. Although results for 
this study failed to support the validity of the R and 
C scales, many other investigations have yielded 
confirmatory findings (Holland, 1985ab). Schinka, 
Dye, and Curtiss (1997) provide a thoughtful 
analysis and discussion of the relationship between 
NEO dimensions and the SDS scales. 


Campbell Interest and Skill Survey 


The Campbell Interest and Skill Survey (CISS; 
Campbell, Hyne, & Nilsen, 1992) is a newer mea- 
sure of self-reported interests and skills. The test is 
designed to help individuals make better career 
choices by describing how their interests and skills 
match the occupational world. The primary target 
population for the CISS is students and young 
adults who have not entered the job market, but the 
test is also suitable for older workers who are con- 
sidering a change in careers. The test is appropriate 


for persons 15 years of age and older with a sixth- 
grade reading level, although younger children can 
be tested in exceptional circumstances. 

The CISS consists of 200 interest items and 120 
skill items. The interest items include occupations, 
school subjects, and varied working activities that 
the examinee rates on a six-point scale from 
strongly like to strongly dislike. The interest items 
resemble the following: 


A pilot, flying commercial aircraft 
A biologist, working in a research lab 
A police detective, solving crimes 


The skill items include a list of activities that the ex- 
aminee rates on a six-point scale from expert 
(widely recognized as excellent in this area) to none 
(have no skills in this area). The skill items resem- 
ble the following: 


Helping a family resolve its conflicts 

Making furniture, using woodworking and 
power tools 

Writing a magazine story 


CISS results are scored on several different kinds 
of scales: Orientation Scales, Basic Interest and 
Skill Scales, Occupational Scales, Special Scales, 
and Procedural Checks. All scale scores are re- 
ported as T scores, normed to a population average 
of 50, with a standard deviation of 10. 

The Orientation Scales serve to organize the 
CISS profile—the interest, skill, and occupational 
scales are reported under the appropriate Orienta- 
tions. The seven Orientations are as follows (Camp- 
bell et al., 1992, pp. 2-3): 


e Influencing—influencing others through leader- 
ship, politics, public speaking, and marketing 
Organizing—organizing the work of others, 
managing, and monitoring financial performance 
Helping—helping others through teaching, heal- 
ing, and counseling 

Creating—creating artistic, literary, or musical 
productions, and designing products or environ- 
ments 

aNalyzing—analyzing data, using mathematics, 
and carrying out scientific experiments 
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e. Producing—producing products, using “hands- 
on” skills in farming, construction, and mechan- 
ical crafts 

¢ Adventuring—adventuring, competing, and risk 
taking through athletic, police, and military 
activities 

There are 29 pairs of Basic Scales, each pair con- 
sisting of parallel interest and skill scales. The 
Basic Scales are clustered within the seven Orien- 
tations, based upon their intercorrelations. For 
example, the Helping Orientation contains the fol- 
lowing Basic Scales, each with separate interest 
and skill components: Adult Development, Coun- 
seling, Child Development, Religious Activities, 
and Medical Practice. 

The 58 pairs of Occupational Scales, each with 
separate interest and skill components, provide 
feedback on the degree of similarity between the 
examinee and satisfied workers in that occupation. 
These scales were constructed empirically by con- 
trasting the responses of happily employed persons 
in specific occupations with responses of a general 
reference sample drawn from the working popula- 
tion at large. 

In addition to Basic and Occupational Scales, 
the CISS incorporates three special scales: Aca- 
demic Focus, a measure of interest and confidence 
in intellectual, scientific, and literary activities; Ex- 
traversion, a measure of social extraversion; and 
Variety, a measure of the examinee’s breadth of 
interests and skills. Finally, the CISS reports a va- 
riety of Procedural Checks to detect possible prob- 
lems in test taking such as random responding or 
excessive omissions. 

Overall, the reliability of CISS scales is excep- 
tionally strong. For example, coefficient alpha for 
the Orientation Scales is typically in the high .80s, 
and three-month test-retest reliabilities for 324 re- 
spondents are in the mid- to high .80s. Similar find- 
ings for reliability are reported for the Basic and 
Occupational Scales. Norms for the CISS are based 
upon 5,000 subjects spread over the 58 occupations. 
The authors report extensive validity data for the Oc- 
cupational Scales, including sample means for each 


occupational sample as well as lists of the three high- 
est- and lowest-scoring occupations for each scale 
(Campbell et al., 1992). These data document that 
the scales do discriminate between occupations in an 
effective and meaningful way. For example, the av- 
erage T score on accountant by accountants is 75.8. 
Statisticians, bookkeepers, and financial planners 
achieve the next three highest scores for this scale, 
with average T scores in the low 60s. Commercial 
artists, professors, and social workers obtain the 
three lowest scores, with average T scores around 40. 
Because these results fit well with our expectations 
about occupational interest and skill patterns, they 
provide support for the validity of the CISS. 

Independent correlational studies also support 
the validity of the CISS. For example, in a sample 
of 118 adults, Savickas et al. (2002) correlated 
scores from individual occupational scales of the 
CISS with scores from the scales of other main- 
stream instruments such as the Strong Interest In- 
ventory. They found strong support for both 
convergent validity (i.e., modest correlations for 
same-named pairs of scales) and discriminant va- 
lidity (i.e., negligible correlations for unlike pairs 
of scales). In a sample of 128 college students, 
Hansen and Neuman (1999) confirmed the concur- 
rent validity of the CISS by finding a good fit be- 
tween occupational scale scores and students’ 
chosen majors. The fit was considered “excellent” 
or “moderately good” for more than 70 percent of 
the students. Boggs (1999) provides a review and 
critique of the CISS. Campbell (2002) presents the 
history and development of the instrument. 

This instrument will almost certainly receive 
increased attention in the years ahead. One note- 
worthy feature of the CISS is the comprehensive- 
ness and clarity of the profile report form. The 
report consists of 11 user-friendly pages. We have 
reprinted two pages in Figure 12.3 for illustrative 
purposes. This format is preferable to the detail- 
rich but eye-straining graphs encountered with 
many instruments. The CISS promises to rival the 
Strong Interest Inventory for vocational guidance 
of young adults. 
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(continued) 


Note: The full profile consists of an 11-page printout. 





FIGURE 12.3 Representative Sections from the Campbell Interest and Skill Survey 
Source: From Campbell, D. P., Hyne, S., & Nilsen, D. (1992). Manual for the Campbell Interest and Skill Survey. Minneapolis, MN: 
National Computer Systems. Copyright © by David Campbell, Ph.D. Reproduced with permission. 
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CAMPBELL INTEREST AND SKILL SURVEY INDIVIDUAL PROFILE 
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The Influencing Orientation covers the general area of leading and influencing others. People who score high are interested in 
making things happen. They want to take charge and are willing to accept responsibility for the results. Influencers are generally 
confident of their ability to persuade others to their viewpoints, and they enjoy the give and take of verbal jousting. They 
typically work in organizations, and often want to take charge of the specific activities that particularly interest them, They enjoy 
public speaking, and like to be visible in public. Typical high-scoring occupations include company presidents, corporate 
managers, and school superintendents. 













You have expressed a strong interest in the area of organizational leadership -- for being in charge and accepting the responsibility 
for the outcome. You would probably be comfortable in situations where you were responsible for directing the work of others, 
setting organizational policies and motivating people around you. 


Further, you have also reported a high level of confidence in your abilities in leading and motivating other people. You would 
probably enjoy being in charge of your own department, division, or organization, and are quite confident that you could perform 
well, Because both your Interest and Skill scores are high, this is an appealing area for you. 












Your scores on the Influencing Basic Scales, which provide more detail about your Interests and Skills in this area, are reported 
above on the left-hand side of the page. Your scores on the Influencing Occupational Scales, which show how your pattern of 
interests and skills compares with those of people employed in Influencing occupations, are reported above on the right-hand side 
of the page. Each occupation has a one to three-letter code which indicates its highest Orientation score(s). The more similar 
the Orientation code is to your highest Orientation scores (which are reported on page 2), the more likely you will find satisfaction 
working in that occupation. 










* Standard Scores: I (@) = Interests; S (0) = Skills 
**  Interest/Skill Pattern: Pursue = High Interests, High Skills; Develop = High Interests, Lower Skills; 
Explore = High Skills, Lower Interests; Avoid = Low Interests, Low Skills 
*** Orientation Code: [=[nfluencing; O=Organizing; H=Helping; C=Creating; N=aNalyzing; P=Producing; A=Adventuring 
Ges Range of middle 50% of people in the occupation: Solid Bar = Interests; Hollow Bar = Skills 










FIGURE 12.3 Continued 
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CAREER AND WORK 
VALUES ASSESSMENT 


In Working, his monumental discourse about Amer- 
icans on the job, Studs Terkel concluded that work 
is a search 


For daily meaning as well as daily bread, for recog- 
nition as well as cash, for astonishment rather than 
torpor; in short, for a sort of life rather than a Mon- 
day through Friday sort of dying. Perhaps immor- 
tality, too, is part of the quest. To be remembered 
was the wish, spoken and unspoken, of the heroes 
and heroines of this book. (Terkel, 1974) 


People seek meaning in their work. After inter- 
viewing hundreds of workers, Terkel concluded 
that only a lucky few find this meaning. Everyone 
can recall such fulfilled souls: the minister who is 
adored by his flock, the landscaper who proudly 
leaves an enduring legacy, the auto mechanic who 
delights in the perfectly tuned engine, or the oral 
historian who rescues a piece of the past. But con- 
trasted with these few, Terkel discovered that the 
vast majority harbor a “hardly concealed discon- 
tent” about work. 

Whether we agree or disagree with this pes- 
simistic position, it is clear that values play an im- 
portant role in work satisfaction, career choice, and 
career development. This is especially evident when 
a mismatch arises between personal values and the 
dominant values required by a career. In her book on 
career changes, Jones (1980) relates the story of an 
advertiser who came to a painful midlife realization: 
“I disliked the focus of my work. The advertising of 
bad products is damaging the country. ... The 
whole idea of advertising seemed wrong” (p. 27). 
This person was so dissatisfied with his vocation 
that he switched to another field in midlife. Appar- 
ently, he valued service to others—a work value that 
collided head-on with the amoral stance prevalent 
in advertising. We can only wonder how he picked 
such a mismatched career in the first place, but it 
seems unlikely that it was a rational choice based on 
an assessment of his work values. 

It is also worth asking about the source of the 
discontent that Terkel discovered. Is this discontent 
largely unavoidable, inherent to the very nature of 
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work? Or does it arise, at least in part, from the mil- 
lions of individual mismatches between what a job 
offers and what a worker needs? 

In this section we expand upon the theme de- 
veloped in the preceding topic—that career choice 
can be enhanced through the appropriate applica- 
tion of career assessment tools. The reader will first 
encounter individual tests of work values and ca- 
reer development. Next, we discuss the integrative 
career assessment model. In this approach, abili- 
ties, interests, and personality characteristics are in- 
tegrated in vocational guidance. The chapter closes 
on the related topic of consumer assessment. 

Work values refer to needs, motives, and 
values that influence vocational choice, job satis- 
faction, and career development. Even when back- 
ground factors such as intelligence, education, and 
ability are held constant, it is clear that individuals 
differ in their work values. What one person desires 
from his or her work might be positively poisonous 
to another individual of equal intelligence, educa- 
tion, and ability. 

Here is a true story to illustrate this point. On 
behalf of several families, an attorney specializing 
in personal injury filed a lawsuit against a large 
mining corporation. The mining company was ac- 
cused of spewing poisonous lead smelter emissions 
into the air breathed by hundreds of small-town res- 
idents, causing subtle neurological damage to 
dozens of children. The lawsuit involved more than 
20 expert witnesses and dragged on for years. The 
lawyer was deep in debt from financing the pro- 
tracted litigation—he faced bankruptcy if the law- 
suit failed. Yet, he was ecstatic as he approached the 
final showdown in U.S. Federal Court, his entire 
career on the line. For this individual, perilous risk 
taking was a cherished work value. In contrast, 
most persons would actively avoid this kind of 
high-stakes gamble with their careers. 

A proper match between work values and ca- 
reer choice is essential for job satisfaction. Some 
people succeed in finding such a match. For exam- 
ple, the intrepid lawyer mentioned here was well 
suited to his career path. But for those uncertain 
about career choice, feedback about work values 
can provide much-needed guidance. There are 
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several assessment tools that might be helpful in 
this regard, but three instruments deserve special 
mention: the Minnesota Importance Questionnaire, 
the Work Values Inventory, and the Values Scale are 
reviewed in the following sections. 


Minnesota Importance Questionnaire 


The Minnesota Importance Questionnaire (MIQ) 
was developed to measure vocational needs and 
values of adults from high school age on up. The 
test has a solid foundation in a theory of work 
adjustment that emphasizes the importance of per- 
son-environment correspondence in determining 
satisfaction from work (Dawis, England, & Lof- 
quist, 1964; Dawis & Lofquist, 1984; Lofquist & 
Dawis, 1991). According to the theory, work satis- 
faction is directly related to the correspondence 
between the worker’s needs and the rewards or re- 
inforcers available from the job. For example, a 
prospective employee who has a strong need to 
help other people will probably find satisfaction in 
a job that provides plentiful opportunities for social 
service; conversely, he or she might be miserable in 
a position that emphasizes solitary work. 

In its most popular form—which consists of 
paired-comparison items—the MIQ measures 20 
needs organized into six underlying values relevant 
to work satisfaction. The test also comes in aranked 
format that we do not discuss here. The six values 
emerged from factor analyses of the needs. The val- 
ues and their component needs are listed in 
Table 12.3. It is important to emphasize that each 
need “scale” actually consists of a single need state- 
ment. For example, the Independence need “scale” 
actually consists of a single statement resembling 
the following: “Could make my own decisions.” 

The MIQ consists of 210 items. These include 
190 items that pair each of the 20 needs with every 
other need. An additional 20 items require absolute 
judgments of the importance of each need dimension. 
The paired-comparison items are in reference to the 
examinee’s “ideal job” and resemble the following: 


Could give me a sense of accomplish- 
. ment OR 
Could make my own decisions 








TABLE 12.3 Values and Components of the 
Minnesota Importance Questionnaire 


Values Components 


Ability Utilization 
Achievement 

Activity 

Independence 

Variety 

Compensation 

Security 

Working Conditions 
Advancement 
Recognition 

Authority 

Coworkers 

Social Service 

Moral Feelings 
Company Policies 
Supervision—Human Relations 
Supervision—Technical 
Creativity 
Responsibility 


Achievement 


Comfort 


Status 


Altruism 


Safety 


Autonomy 





The examinee is instructed to select the alternative 
of greater personal importance in a job—hence the 
reference to Importance in the title of the instru- 
ment. In the preceding example, the achievement 
need (Could give me a sense of accomplishment) is 
matched with the responsibility need (Could make 
my own decisions). In order to pair each need with 
every other need, 190 items are required.? Each of 
the 20 work needs is also rated individually on an 
absolute scale of importance, which results in a 
total of 210 items. These absolute judgments per- 
mit comparisons across examinees or across scales 
within examinees. 

The MIQ is interpreted in reference to occu- 
pational reinforcer patterns (ORPs) for nearly 
200 occupations. The ORPs were derived from a 
parallel research program using the Minnesota 
Job Description Questionnaire (MJDQ), a scale 
that resembles the MIQ. The MJDQ requires cur- 


2. The total number of paired comparisons for N statements 
(when order is not relevant) can be calculated from the formula 
N(N — 1)/2, which is 20(19)/2 or 190. 
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rent job holders to rate the perceived presence or 
absence of reinforcers in a given job. Of course, 
these reinforcers are simply the 20 work needs ap- 
propriately restated so as to capture occupational 
requirements. 

By comparing the profile of examinee needs 
and values with known reinforcer patterns for rep- 
resentative occupations, MIQ results can be used to 
predict satisfaction in specific jobs. This is done by 
means of the C-Index, or correspondence index, 
which is the correlation coefficient between the in- 
dividual’s MIQ profile and the ORP for each occu- 
pation. Satisfaction is predicted in an occupation 
when the correlation exceeds .50. 

The paired-comparison format of the MIQ per- 
mits the examiner to evaluate the consistency of re- 
sponses. Consider any three needs, designated as A, 
B, and C. Suppose an examinee prefers A to B and 
also prefers B to C. Logically, this person also 
should prefer A to C (transitivity). This is an exam- 
ple of a logically consistent triad (LCT). The LCT 
score is the percentage of all triads that are logically 
consistent. This score provides an index of response 
consistency that is one measure of test-taking va- 
lidity. LCT scores below 33 raise a suspicion that 
the examinee has responded carelessly or randomly. 
In test-retest studies, the higher the LCT the more 
stable the examinee’s MIQ profile. 

Reliability of the MIQ is fair to excellent, de- 
pending upon the retesting interval. The median test- 
retest correlation for the 20 scales is reported to be 
.89 for immediate retesting, but only .53 for retesting 
after 10 months. Internal consistency reliabilities are 
typically around .80 (Rounds et al., 1981). 

Approximately 200 studies bear upon the va- 
lidity of the MIQ, so it is difficult to summarize 
trends (Layton, 1992). The results indicate that the 
20 MIQ scales discriminate among distinct occu- 
pational groups; that correlations with the Strong 
Vocational Interest Blank are significant and the- 
ory-consistent; and that the MIQ has appropriately 
low correlations with abilities as measured by the 
General Aptitude Test Battery (Benson, 1985; Lay- 
ton, 1992). In an affirming study, scores on the 
MIQ Independence scale moderately predicted 
whether graduate students in counseling psychol- 
ogy would become scientists or practitioners when 
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they entered the job market (Tinsley, Tinsley, 
Boone, & Shim-Li, 1993). 

The MIQ is a well-respected instrument that de- 
serves to be broadly used. Curiously, the test has 
never really captured wide attention. Perhaps this is 
due to the format of the instrument, which might be 
an impediment to its adoption by human service 
personnel. The problem is that examinees encounter 
the same need statements time and again. In order 
to pair each need with every other need, it is neces- 
sary to use each individual need statement in 19 sep- 
arate test items. Examinees feel like they encounter 
the same questions over and over, even though each 
item on the MIQ is, in fact, unique. Regardless of 
its psychometric soundness, from the standpoint of 
the examinee the MIQ is an unappealing instrument. 


Work Values Inventory 


The Work Values Inventory (WVI) is a short and 
simple instrument designed to measure 15 work 
values in individuals from junior high level through 
high school (Super, 1968, 1970). The test is the end 
product of decades of research on the goals that mo- 
tivate individuals to work. The 15 work values were 
identified through a literature review that included 
the early, classic work of Spranger (1928) on Types 
of Men. Items, scales, and test formats were con- 
tinually revised and refined until the current 5-point 
rating approach was selected (Super, 1970, 1973). 

The WVI is a self-report instrument consisting 
of 45 items rated on a 5-point scale from “Very Im- 
portant” to “Unimportant.” Test items resemble the 
following: “Become famous in your field,’ “Make 
your own job decisions,” “Feel you have helped 
other people,’ and “Have a boss who is consider- 
ate.” There are three items for each of the 15 scales. 
The 15 work values measured by the test include 
the following: 


Altruism Economic Returns 
Esthetic Security 

Creativity Surroundings 
Intellectual Stimulation Supervisory Relations 
Achievement Associates 
Independence Way of Life 

Prestige Variety 

Management 
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These work values are described in detail in the man- 
ual. For example, Altruism is described as “present 
in work that enables one to contribute to the welfare 
of others” and Prestige is described as “associated 
with work that gives one standing in the eyes of oth- 
ers and evokes respect.” The items and scales are 
transparent. For example, the item “Become famous 
in your field” would belong on the Prestige scale. 

Results for the WVI are reported as 15 scale 
raw scores, each ranging from 3 (low score) to 15 
(endorsing “Very Important” for all three items on 
the scale). These scores permit three strategies of 
interpretation. In a clinical analysis, the three high- 
est scores are highlighted for purposes of discus- 
sion by counselor and examinee. The data can also 
be analyzed normatively with respect to results for 
other students of the same age. Most importantly, 
it is also possible to predict satisfaction in various 
occupations by use of occupational reinforcer pat- 
terns (ORPs). The approach is similar to that pre- 
viously discussed for the MIQ, in which the 
counselor determines the degree of match between 
the examinee’s work values and the known rein- 
forcers available in various occupations. 

The technical aspects of the WVI are com- 
mendable. In a sample of 99 tenth-grade students, 
the instrument revealed two-week test-retest relia- 
bilities in the .80s for all scales except Prestige (r= 
.76). The manual provides extensive normative data 
for junior and senior high school students. Validity 
also looks strong, as judged by correlations with 
other measures of work values, factor analysis of 
scales and items, and theory-confirming relation- 
ships with external criteria. Bolton (1985) provides 
an excellent review of validity evidence for the 
WVI. Perhaps the only cautionary note about the 
WV Lis that the instrument is now somewhat dated. 
New norms and reliability data need to be provided. 


Values Scale 


The Values Scale (VS) was developed by a consor- 
tium of researchers under the direction of Super 
and Nevill (1986) to assess 21 values relevant to 
work and life roles. The test consists of five items 
per value, each rated from 1 (“Of little or no impor- 


tance”) to 4 (“Very important’). A final item is used 
when the scale is administered cross-culturally for 
a total of 106 items. The values measured by the 


VS include the following: 

Ability Utilization Physical Activity 
Achievement Prestige 
Advancement Risk 

Aesthetics Social Interaction 
Altruism Social Relations 
Authority Variety 

Autonomy Working Conditions 
Creativity Cultural Identity 
Economic Rewards Physical Prowess 
Life Style Economic Security 


Personal Development 


An unusual but highly desirable aspect to the 
VS is that the test was developed explicitly for use 
in cross-cultural research. An informal consortium 
of research teams from dozens of countries in 
North America, Europe, Australia, Asia, and Africa 
was involved in the definition, revision, and refine- 
ment of values measured by the test. Each national 
team translated the test into its own language and 
pilot-tested the items. 

The reliability of the VS is only fair, which is 
understandable since the instruments contain only 
five items per scale. Alpha coefficients are above 
.70 for all scales, and test-retest reliabilities are 
above .50 in college student samples. Norms are 
provided for a convenience sample of 3,000 U.S. 
students and adults. Initial validity studies are 
promising, but more studies are needed before the 
test is used for individual guidance (Rousseau, 
1989; Slaney, 1989). 

The Values Scale represents the very best in test 
development ideals. By involving dozens of re- 
search teams around the globe, Super and Nevill 
(1986) have conceived a test with true cross-cul- 
tural appeal and utility. Perhaps their efforts will 
help forge a global perspective on the nature and 
value of work. Too often, test development has 
been a parochial activity restricted to Western in- 
dustrialized cultures. We can only hope that other 
test developers will also value the cross-cultural 
perspective in assessment. 
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Assessment of Career Development 


Super (1957, 1990) has emphasized that career 
choice is not a discrete decision but a continuing 
process. He argues that vocational development 
is characterized by stages: growth, exploration, 
establishment, maintenance, and decline. In the 
growth stage, an individual entertains fantasies, de- 
velops interests, and discovers his or her capacities. 
During the exploration stage in adolescence and 
early adulthood, the individual engages in tenta- 
tive examination of careers. This is followed by 
stabilization and consolidation of a career in the es- 
tablishment phase. At the age of approximately 50, 
most individuals enter the maintenance stage, char- 
acterized by innovation and updating for some, but 
stagnation and deceleration for others. The decline 
stage features disengagement for most, but career 
specialization for a few. 

In the beginning stages of career development— 
the growth and exploration stages—traditional 
vocational tests may not provide the best kinds 
of guidance information, since they are usually 
founded on the premise that the examinee is knowl- 
edgeable about work and has well-established 
interest patterns. However, it is typical of indivi- 
duals in these stages to have limited information 
about careers and minimal knowledge of their vo- 
cational interests and values. In these situations, 
specialized instruments are needed for effective ca- 
reer assessment. 

Several vocational measures are based upon a 
recognition that career choice is an ongoing pro- 
cess rather than a single decision. These alternative 
instruments focus upon maturity of career knowl- 
edge, vocational planning, and decision-making 
skills. Several representative career development 
and career maturity tests are mentioned briefly 
in Table 12.4. For a more extended discussion, 
the reader is urged to consult Walsh and Betz (1995). 


OF CAREER ASSESSMENT 


-` Practitioners of career assessment rarely rely upon 
information from a single source such as a survey 


|| INTEGRATIVE MODEL 
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TABLE 12.4 Representative Measures of Career 
Development 





Career Directions Inventory 

(Jackson, 2000) 
Consisting of 100 triads of statements describing 
job-related activities, the examinee marks his/her most- 
preferred and least-preferred activity. The 15 basic in- 
terest scales include both work roles and work styles, 
e.g, administration, food service, sales, outdoors, 
writing, assertive, persuasive, systematic. Excellent 
reliability and validity; norms based upon 12,000 indi- 
viduals from more than 150 educational and occupa- 
tional specialties. 


Career Beliefs Inventory 

(Krumboltz, 1999) 
Comprised of 96 items rated on a five-point scale from 
“strongly agree” to “strongly disagree,” this inventory 
is intended to identify client beliefs that may be block- 
ing his/her career goals. Examples of the 25 scales in- 
clude: career plans, acceptance of uncertainty, intrinsic 
satisfaction, control, approval of others. 


Career Thoughts Inventory 
(Sampson, Peterson, Lenz, Reardon, & Saunders, 
1998) 
Based upon the principles of cognitive therapy, the CTI 
is a self-administered, objectively scored measure of 
dysfunctional thinking in career problem solving and 
decision-making. The 48 items assess decision-making 
confusion, commitment anxiety, and external conflict. 


Career Development Inventory 

(Super, Thompson, Lindeman, and others, 1981) 
A comprehensive measure of career development and 
maturity, the CDI consists of five subtests: Career Plan- 
ning, which measures extent of, and engagement in, ca- 
reer planning; Career Exploration, which evaluates 
current and prospective attempts to obtain career infor- 
mation; Decision Making, which measures ability to 
apply knowledge and insight to career planning; World 
of Work Information, a measure of knowledge of occu- 
pational structure; and Knowledge of the Preferred Oc- 
cupational Group, which provides an in-depth 
assessment of knowledge about the examinee’s single, 
preferred occupational group. 





of interests or work values. The effective vocational 
counselor uses an integrative model in which 
information from interest, ability, and personality 
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domains is considered simultaneously. Lowman 
(1991) has presented the elements of this approach, 
and much of our discussion is based upon his analy- 
sis. Practitioners of this method do not minimize 
the importance of interests in determining career 
choice and satisfaction. However, they tend to 
assign primary importance to ability patterns and 
certain personality characteristics in vocational 
assessment and guidance. We discuss ability pat- 
terns first. 


Ability Patterns in Career Assessment 


Depending upon the career goals of the client, 
several ability dimensions might be relevant to 
career assessment. A partial list includes general 
intelligence (g), mechanical and physical abilities, 
spatial analysis, verbal intelligence, investigative 
skills, artistic abilities, and social intelligence 
(Gottfredson, 1986; Lowman, 1991). The im- 
portance of broad or primary mental abilities 
such as spatial analysis or verbal intelligence is 
fairly obvious. For example, architecture requires 
high levels of spatial analysis for success. In a 
prospective architect, no amount of interest in the 
field can compensate for low ability in spatial 
analysis. Likewise, verbal abilities are essential for 
journalism and other professions that demand 
language proficiency. The relevance of specialized 
aptitudes such as mechanical abilities (for a 
prospective mechanic) and artistic abilities (for 
a prospective artist) is likewise straightforward. But 
what about social intelligence? Is this relevant 
to career assessment? Can social intelligence be 
measured? 

First identified by Thorndike (1920b), social 
intelligence refers to the capacity to understand 
other people and to relate effectively to them. Al- 
though there has been much ongoing controversy 
about the validity of the social intelligence con- 
struct, recent studies indicate that simple paper- 
and-pencil measures can be used to isolate this 
dimension of ability from other aspects of intelli- 
gence (Lowman & Leeman, 1988). We will review 
two studies to illustrate this point. 


Getter and Nowinski (1981) developed the 
Interpersonal Problem Solving Assessment Tech- 
nique (IPSAT), a semistructured free-response test 
of interpersonal effectiveness. In this test, the re- 
spondent is presented with a series of 46 problem- 
atic interpersonal situations and asked to imagine 
being in each situation. Examinees are instructed to 
write alternative ways of handling each situation 
and to indicate which of these potential solutions 
they would actually choose. An example of a situ- 
ation is as follows: 


Your boss (or teacher) has just criticized a piece of 
work that you’ve done, and you think the criticism 
is unjustified and unfair. What do you do? 


Based upon a detailed scoring manual, each re- 
sponse chosen by the examinee is scored in one of 
these categories: Effective, Avoidant, Inappropri- 
ate, Dependent, and unscorable. The grand total 
number of responses is first tallied to provide an 
index of the examinee’s ability to think of alterna- 
tive courses of action. Then, the number of chosen 
solutions scored in each category is counted, pro- 
viding a profile of the types of solutions preferred 
by the examinee. Interscorer reliability of IPSAT 
subscales is quite high, and correlations with other 
instruments strongly support the convergent and 
discriminant validity of this instrument. 

A more recent and promising inventory of social 
intelligence is the 128-item, true-false Social Rela- 
tions Survey (SRS) developed by Lorr, Youniss, and 
Stefic (1991). They used a rational scale construc- 
tion method buttressed with factor analysis to pro- 
duce an instrument that measures eight factors of 
social intelligence. The subscales and illustrative 
items are depicted in Table 12.5. For 49 subjects 
retested after two weeks, the median test-retest re- 
liability of the subscales is an impressive .89. Norms 
are provided for 260 college women and 355 high- 
school women. Several approaches to concurrent 
and construct validity indicate that the SRS provides 
a useful and valid approach to the self-report as- 
sessment of social skills. 

Beyond a doubt, social intelligence is highly 
relevant to career guidance. For example, a pros- 
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TABLE 12.5 Scales and Illustrative Items from the Social Relations Survey 


Social Assertiveness 

I find it easy to talk with other people that I have just 
met. (T) 

When I meet new people I usually let them bring up 
things to talk about. (F) 


Directiveness 

I am at my best when I am the person in charge. (T) 

I am comfortable letting others take the lead in a 

group. (F) 

Defense of Rights 

If someone breaks in line in front of me, I speak up. (T) 


I am uncomfortable returning merchandise to a 
store. (F) 


Confidence 
I feel confident most of the time. (T) 
I feel dissatisfied with my abilities. (F) 


Perceived Approval 

I am sure that people I know think well of me. (T) 
I sometimes feel a sense of disapproval from others 
around me. (F) 


Expression of Positive Feelings 
I like to show my positive feelings for others. (T) 


I am uncomfortable showing affection for a friend in 
public. (F) 


Social Approval Need 
I make a deliberate effort to make myself popular. (T) 
I am unconcerned with what people say about me. (F) 


Empathy 

I am strongly affected when friends tell me about their 
problems. (T) 

I usually maintain an objective and detached feeling 
toward others. (F) 





Source: Based on Lort, M., Youniss, R., & Stefic, E. (1991). An inventory of social skills. Journal of Personality Assessment, 57, 506-520. 


pective nurse will need high levels of social intelli- 
gence to function effectively on the job. In contrast, 
a computer technician may need little in the way of 
social skills to excel in the work environment. Low- 
man (1991) presents a hypothetical taxonomy of 
social intelligence to illustrate its relevance to ca- 
reer assessment (Table 12.6). Although the mea- 
surement of social intelligence remains a challenge, 
practitioners would be foolish to ignore this ability 
factor in career assessment. 


Personality Patterns in Career Assessment 


Several personality dimensions are also highly rel- 
evant to career assessment. Personality testing is 
discussed in detail in later chapters, so we will only 
mention a few occupationally relevant personality 
dimensions here: 


e Need for achievement is important for persons 
with business and managerial aspirations (e.g., 
Orpen, 1983). 


e Ascendance or dominance is also important 
for success in managerial ranks (e.g., Bentz, 
1985). 

¢ Emotional stability predicts positive perfor- 
mance in a wide range of traditional jobs, 
whereas neuroticism is associated with success 
in some artistic professions (e.g., Wills, 1984). 

e Masculinity and femininity differ significantly 
between various occupational groups (e.g., 
Gough, 1987). 


Research on the relevance of personality dimen- 
sions to career assessment is still in its infancy. 
Nonetheless, preliminary trends such as those pre- 
viously noted clearly demonstrate the relevance of 
personality variables to job success. Practitioners 
are advised to consider occupationally relevant 
personality dimensions in career assessment. In 
sum, career assessment is a multifaceted enter- 
prise that must take into account not just interests, 
but also ability patterns and personality traits as 
well. 
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TABLE 12.6 Hypothetical Taxonomy of Social Demands of Jobs 


Degree of Social 
Involvement 


Very high 
High 
Moderate 
Slight 
Low 


Very low 


Social Job Dimensions 


Therapeutic, educational or management roles such as business 
manager, nurse, or psychotherapist; high degree of social interaction 
Social contact is not always primary, e.g., college professor, social 
science researcher; moderate degree of social interaction. 

Minimal social interaction, but social facilitation needed, e.g., high- 
level executive 

Minimal social interaction and minimal need for concern with feel- 
ings and reactions of others, e.g., clerk in discount department store 
Very limited interaction with people and no requirement for thera- 
peutic or influencing roles, e.g., laboratory scientist or novelist 
Social skills not needed; the work setting may be unsociable, such as 
computer programmer 





Source: Based on Lowman, R. L. (1991). The clinical practice of career assessment: Interests, abilities, 
and personality. Washington, DC: American Psychological Association. 


SUMMARY 


1. A value is a shared, enduring belief about 
ideal modes of behavior or end states of existence. 
The assessment of values is important because val- 
ues instill action, shape attitudes, and guide efforts 
to influence others. 


2. Of historical interest, the Study of Values 
scale provided a rank ordering of six values: Theo- 
retical, Economic, Aesthetic, Social, Political, and 
Religious. Used in contemporary research, the 
Rokeach Value Survey defines and measures two 
kinds of values: instrumental (e.g., Ambitious, 
Imaginative) and terminal (e.g., a comfortable life, 
inner harmony). 


3. The purpose of interest inventories is to 
identify a person’s vocational and related interests 
in order to facilitate career choices. A good fit be- 
tween personal interests and the identified interest 
patterns of an occupation promotes success in, and 
satisfaction with, occupational choice. 


4. The Strong Interest Inventory (SII) is the 
latest revision of the Strong Vocational Interest 
Blank (SVIB), which first appeared in 1927. Like 
its predecessor, the SII uses empirical keys for 
occupations. 


5. Short-run stability coefficients for the 211 
occupational scales on the SII are generally in the 
.90s. The validity of the test is bolstered by the gen- 
erally good fit between the initial occupational pro- 
file and the occupation eventually pursued. 


6. The Jackson Vocational Interest Scale (JVIS) 
uses a forced-choice item format to reduce the im- 
pact of social desirability. The derivation of the 34 
basic interest scales was rational and theory guided. 
The scales are reasonably independent of one an- 
other and possess short-run stability coefficients in 
the mid-.80s. Validity studies are promising. 

7. The 168-item Kuder General Interest Sur- 
vey (KGIS), used with adolescents in grades 6 
through 12, produces 10 broad interest scores. 
Users must be cautious not to overinterpret the 
Kuder: the mean four-year stability coefficient for 
scales is only .50. 


8. The 160-item Vocational Preference In- 
ventory (VPI) is an objective, paper-and-pencil per- 
sonality interest inventory that assesses eleven 
dimensions, including the six personality—environ- 
ment themes of Realistic, Investigative, Artistic, 
Social, Enterprising, and Conventional (RIASEC). 


TOPIC 12A INTERESTS AND VALUES IN VOCATIONAL ASSESSMENT 467 


9, The Self-Directed Search (SDS) is a self- 
administered and self-scored test of vocational in- 
terest. The SDS is also based upon the RIASEC 
model; each theme of this model characterizes not 
only a type of person but also the type of work en- 
vironment that such a person finds most compatible. 


10. The Campbell Interest and Skill Survey 
(CISS) consists of 200 interest items and 120 skill 
items that are rated upon a six-point scale. The test 
provides scores on seven Orientation Scales (Influ- 
encing, Organizing, Helping, Creating, Analyzing, 
Producing, and Adventuring), 29 Basic Scales, and 
58 Occupational Scales. Reliability is exception- 
ally strong and concurrent validity with similar 
tests is very robust. 


11. Work values refer to needs, motives, and 
values that influence vocational choice, job satis- 
faction, and career development. A proper match 
between work values and career choice is impor- 
tant for job satisfaction. 


12. A useful measure of work values is the 
Minnesota Importance Questionnaire (MIQ). One 
version of the test consists of paired-comparison 
items (e.g., Could give me a sense of accomplish- 
ment versus Could make my own decisions) that 
assess 20 needs organized into six underlying val- 
ues relevant to work satisfaction. 


13. Super’s 45-item Work Values Inventory 
(WVI) is a short instrument designed to measure 15 
work values in junior and senior high school stu- 
dents. Reliability and validity of the instrument are 
good. Like the MIQ, the WVI is interpreted in re- 
lation to occupational reinforcer patterns. 


14. The Values Scale (VS) is a cross-culturally 
derived instrument designed to assess 21 values rel- 
evant to work and life roles. The test consists of five 
items per value rated on a four-point scale. Primar- 
ily a research instrument, applications for individ- 
ual guidance should be approached cautiously. 


15. For some persons, career choice is not a 
discrete decision but a continuing process. Super 
outlines five stages of career development: growth, 
exploration, establishment, maintenance, and de- 
cline. In the growth and exploration stages, spe- 
cialized instruments are helpful for effective career 
assessment. 


16. Lowman has proposed an integrative model 
of career assessment that includes simultaneous 
consideration of interest, ability, and personality do- 
mains. Ability patterns, including social. intelli- 
gence, can be very important in some vocations. 


KEY TERMS AND CONCEPTS 


value p. 442 
RIASEC model p.451 


work values p. 459 


occupational reinforcer patterns p. 460 
p. 463 
social intelligence p. 464 


integrative model 


Torıc12B Attitudes and the Assessment 
of Moral and Spiritual Concepts 


Attitudes and Their Assessment 
The Assessment of Moral Judgment 
The Assessment of Spiritual and Religious Concepts 


Summary 
Key Terms and Concepts 


T: previous topic focused on interests and 
values and the issues raised in their assess- 
ment. In this topic we continue the discussion of ad- 
ditional loosely defined but nonetheless useful 
constructs such as attitudes, moral values, and spir- 
itual concepts. We begin with attitudes because this 
concept is foundational, and the methods used in 
the assessment of attitudes can be widely applied. 
The topic then turns to assessment approaches in 
the moral, spiritual, and religious domains. This in- 
cludes lengthy coverage of Kohlberg’s (1981, 
1984) classic method for the measurement of moral 
reasoning. Finally, we close with brief coverage of 
the overlooked literature on the measurement of 
spiritual and religious concepts. 


ATTITUDES AND THEIR ASSESSMENT 


Throughout the history of psychology, the notion of 
attitude has played an essential role in the expla- 
nation of behavior. Gordon Allport (1935), an early 
pioneer in attitude research, characterized the con- 
cept of attitude as “distinctive” and “indispensable” 
to social psychology. The importance of attitude as 
a psychological construct has not diminished in re- 
cent years. For example, a search of PsychINFO 
with the keyword “attitude” revealed more than 
12,000 articles published from 1992 to 2002. 
Attitudes are closely linked to related concepts 
such as values, opinions, and beliefs, so it is im- 
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portant to distinguish these terms (Aiken, 2002). As 
discussed earlier, a value is a shared and enduring 
idea about what is ideal; that is, a value refers to 
what is ultimate or best. In contrast, an opinion can 
be regarded as the overt, conscious, verbal demon- 
stration of an attitude. Aiken (2002) notes that 
opinions are “less central, more specific, more 
changeable, and more factually based” than atti- 
tudes. Also, opinions are expressed in words, 
whereas attitudes may not be. Finally, a belief is a 
conviction that something is true, even though it 
cannot be rigorously proved. Religious beliefs cer- 
tainly fall into this category. Thus, a belief is some- 
where between knowledge and attitude. 

Having said what attitudes are not, we now turn 
to a positive definition: 


Attitudes may be viewed as learned cognitive, 
affective, and behavioral predispositions to respond 
positively or negatively to certain objects, situa- 
tions, institutions, concepts, or persons. Attitudes 
may be quite individual and thereby reflective of 
and related to personality characteristics such as a 
need for closure. A need for closure is expressed as 
a desire to complete a task, as in finding an answer 
to a question or a solution to a problem. (p. 3) 


The central constituent of attitudes is that they al- 
ways have an evaluative component to them—atti- 
tudes involve positive or negative responses of 
some kind. But it is also clear that the expression of 
attitudes can be multifactorial (i.e., cognitive, affec- 
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tive, or behavioral). Also, an attitude always has an 
object—it is in reference to something. Examples 
of attitudinal objects include the death penalty, 
handguns, Republicans, former president William 
Clinton, cold weather, telemarketers, slow drivers, 
and a late start to the school year. Finally, attitudes 
serve motivational functions by helping individu- 
als organize their perceptions and make sense out of 
the world. 


Assessment of Attitudes 


Obviously, an attitude is an unobservable, hypo- 
thetical construct. As such, it must be inferred from 
measurable responses indicating positive or nega- 
tive evaluations of the attitudinal object. Attitudes 
may be inferred from cognitive responses (e.g., 
knowledge of the attitudinal object), behavioral re- 
sponses (e.g., intentions or actions with respect to 
the object), and affective responses (expressed feel- 
ings toward the object). Even though attitudes can 
be inferred from all three sources, psychologists 
generally regard the affective response as the most 
essential aspect of attitudes (Ajzen, 1996). 

Attitude measurement has followed a different 
path than assessment in other areas such as per- 
sonality or intelligence. In these other areas, re- 
searchers typically try to develop a small number of 
definitive instruments that will become widely 
adopted in the field. In contrast, the typical ap- 
proach in attitude assessment is for most re- 
searchers to develop their own unique instruments. 
This is because attitudes come in a virtually infi- 
nite supply, depending upon the attitudinal object 
of interest to researchers. 


Approaches to Attitude Assessment 


There are three broad approaches to the measure- 
ment of attitudes: behavioral, covert, and question- 
naire. We discuss the behavioral and covert 
approaches briefly before reviewing the mainstay 
of attitude assessment—questionnaire approaches. 

The behavioral approach involves the direct 
measurement of intentions or actions in regard to 
the attitudinal object. For example, if a door-to- 


door canvasser asks for donations to the Jennifer 
Jones for Mayor fund, the willingness to donate (as 
well as the amount) would be an index of attitudes 
toward Ms. Jones as a mayoral candidate. Other ex- 
amples of the behavioral approach to attitude as- 
sessment would be asking people to sign a petition 
or to circulate flyers regarding a certain cause (e.g., 
building a new municipal swimming pool). De- 
clining to sign the petition or to circulate flyers 
would indicate a negative attitude toward the cause, 
whereas willingness to do either would signify a 
positive attitude. 

Covert approaches to attitude assessment in- 
volve unobtrusive procedures and measurements. 
For example, in the lost-letter technique (Schwartz 
& Ames, 1977), a researcher would prepare hun- 
dreds of envelopes, each with a new stamp, osten- 
sibly addressed to different organizations. In 
reality, it is only the name of the agency that differs 
among the envelopes (e.g., some addressed to 
“Campaign to End Capital Punishment” others ad- 
dressed to “Campaign to Promote Capital Punish- 
ment”), whereas the street address is actually that 
of the researcher. These letters are then surrepti- 
tiously dropped on busy sidewalks throughout the 
city. On the assumption that individuals will rescue 
(and mail) the letters that appear to support their 
views (and may discard the others), the relative re- 
turn rates for the two kinds of letters is then an 
index of city-wide attitudes toward the concept of 
interest, for example, capital punishment. 

The implicit association test (IAT) is another 
example of a covert measure of attitudes. In an im- 
plicit association test, the researcher uses reaction 
time to measure the automatic or “unconscious” as- 
sociations of individuals to different target con- 
cepts. Greenwald, McGhee, and Schwartz (1998) 
explain the rationale for this approach by contrast- 
ing a hypothetical experiment with their real study. 
In the hypothetical experiment, the examinee is 
asked to view a series of male and female faces, 
saying “hello” if the face is male and “goodbye” if 
it is female. Of course, the responses are timed: For 
a second task, the participant says “hello” for male 
names and “goodbye” for female names. Finally, 
the two tasks are combined with the four kinds of 
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stimuli presented in a random manner. Of course, 
this would be an easy task, and response times 
would be quite fast. Greenwald et al. (1998) then 
explain the rationale for their real study as follows: 


One might appreciate the IAT’s potential value as a 
measure of socially significant automatic associa- 
tions by changing the thought experiment to one in 
which the to-be-distinguished faces of the first task 
are Black or White (e.g., “hello” to African Ameri- 
can faces and “goodbye” to European American 
faces) and the second task is to classify words as 
pleasant or unpleasant in meaning (“hello” to 
pleasant words, “goodbye” to unpleasant words). 
The two possible combinations of these tasks can 
be abbreviated as Black + pleasant and White + 
pleasant. Black + pleasant should be easier [faster 
reaction times] than White + pleasant if there is a 
stronger association between Black Americans and 
pleasant meaning than between White Americans 
and pleasant meaning. If the preexisting associa- 
tions are opposite in direction—which might be 
expected for White subjects raised in a culture im- 
bued with pervasive residues of a history of anti- 
Black discrimination—the subject should find 
White + pleasant to be easier. (p. 1466) 


In an actual IAT study, respondents press computer 
keys rather than verbalizing their responses, per- 
mitting accurate timing to a millisecond. The ad- 
vantage of the IAT approach is that it presumably 
gets around the social desirability bias encountered 
with paper-and-pencil measures. The procedure is 
designed to reveal attitudes—even when partici- 
pants prefer not to express. these attitudes. Cur- 
rently, the IAT approach is used mainly for research 
to test theories in social psychology. 

Psychophysiological measurements also have 
been used to assess attitudes. For example, pupil- 
lary response can be measured by unobtrusive cam- 
eras aimed at the pupil of the viewer’s eye as he or 
she looks at different pictures. This is the science 
of pupillometrics (measurement of pupil size). 
When other factors are held constant (e.g., back- 
ground light), a larger pupil is presumed to indicate 
a greater interest in the observed picture (Hess & 
Polt, 1960). 

Pupillometrics reached its high point with the 
publication of The Tell-Tale Eye: How Your Eyes 
Reveal Hidden Thoughts and Emotions (Hess, 


1975). In this book, Hess recounts an intriguing ap- 
plication of pupillometrics to advertising. A large 
number of observers looked at two different adver- 
tisements for Encyclopaedia Britannica. One was 
a new ad showing boys in a pool, the other a stan- 
dard ad depicting a wholesome family scene. Based 
on a questionnaire, the observers expressed a pref- 
erence for the new ad (a more favorable attitude). 
However, their pupils did not dilate at all for the 
new ad, whereas they dilated significantly for the 
standard ad. The two ads were placed in different 
copies of a magazine, together with a coded reply 
card. The two versions of the magazine sold ap- 
proximately the same number of copies. However, 
the return rate for reply cards sent with the standard 
ad was far higher than for the new ad. Thus, pupil- 
lometrics predicted apparent attitudes much better 
than a traditional questionnaire technique. Even 
though this experiment was plagued with method- 
ological weaknesses, the findings did serve to pop- 
ularize the use of pupillometrics as an attitudinal 
measure. However, these techniques are expensive 
and therefore inefficient when the goal is to assess 
attitudes in a large group of individuals. Another 
concern is that pupil enlargement may signify not 
just a positive attitude, but also arousal or novelty 
of the stimulus picture. 


Questionnaires in Attitude Assessment 


The vast majority of attitude measures are question- 
naires based upon established scaling methods. The 
reader will recall from Topic 4B (Test Construction) 
that a variety of scaling methods are available, in- 
cluding the method of equal-appearing intervals, the 
method of absolute scaling, the Likert scale, and the 
Guttman scale. Without a doubt, the Likert scale is 
the most popular in attitude measurement. In this ap- 
proach, the examinee is offered five (or sometimes 
seven) responses ordered on an agree/disagree con- 
tinuum. For example, one item on a scale to assess 
attitudes toward death might read: 


It makes me anxious when people talk about death. 
Do you: 


HI HI H i H 
Strongly Agree Undecided Disagree Strongly 
Agree Disagree 
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Ina Likert scale, the total score is obtained by adding 

up the scores (1 to 5) for individual items. Of course, 

scoring is reversed for negatively phrased items. 
By definition, an attitude measure is supposed 


to tap a highly homogeneous construct, especially 


insofar as the affective response (extent of positive 
or negative feelings about the attitudinal object) is 
central to attitudes. For this reason, the most im- 
portant psychometric quality of an attitude measure 
is that it should possess strong internal consistency 
as measured by coefficient alpha or a related index. 
In regard to validity of attitude measures, an im- 
portant point is that attitudes are highly robust 
human characteristics. Thus, attitude scales mea- 
suring similar constructs should correlate very 
highly, even when the scales are developed by dif- 
ferent researchers (Davis & Ostrum, 1996). 

The characteristics of a good attitude-related 
measure can be illustrated with a specific instru- 
ment, the Gratitude Questionnaire-Six Item Form 
(GQ-6; McCullough, Emmons, & Tsang, 2002). 
The GQ-6 is a simple self-report measure of the 
disposition to experience gratitude (Figure 12.4). 
Strictly speaking, the GQ-6 is a trait measure of the 
grateful disposition. However, this trait is affective 
in nature, therefore the instrument fittingly illus- 
trates the concepts and issues involved in develop- 
ing an attitude measure. 

The reader will notice that the GQ-6 is based on 
a Likert-type format with seven alternatives ranging 
from 1 (strongly disagree) to 7 (strongly agree). 
Two items are stated in the reverse (and therefore 
reverse scored) as a way of inhibiting response bias. 
The development and choice of specific test items 
was based on a thorough analysis of the many facets 
of the grateful disposition (McCullough et al., 
2002). The authors determined that gratitude re- 
flects intensity (feeling more intensely grateful), 
frequency (feeling grateful many times a day), span 
(grateful for many things), and density (grateful to 
many individuals). Initially, they proposed 39 items 
to measure these qualities. The GQ-6 is composed 
of the six best items, as determined by factor- 
analytic procedures performed with test results from 
two samples: 238 undergraduates and 1,228 adult 
volunteers surveyed via the Internet. Reliability of 
the instrument is good, with coefficient alphas be- 





Using the scale below as a guide, write a number beside 
each statement to indicate how much you agree with it. 


1 = strongly disagree 5 = slightly agree 
2 = disagree 6 = agree 

3 = slightly disagree 7 = strongly agree 
4 = neutral 


—— 1. I have so much in life to be thankful for. 

2. If I had to list everything that I felt grateful for, 
it would be a very long list. 

3. When I look at the world, I don’t see much to 
be grateful for.* 

4. I am grateful to a wide variety of people. 

5. As I get older I find myself more able to appre- 
ciate the people, events, and situations that 
have been part of my life history. 

6. Long amounts of time can go by before I feel 
grateful to something or someone.* 


*Items 3 and 6 are reverse scored. 





FIGURE 12.4 The Gratitude Questionnaire-Six Item 


Form (GQ-6) 
Source: Reprinted with permission of Michael McCullough and 
Robert Emmons. Copyright © 2002, all rights reserved. 


tween .82 and .87. Validity of the GQ-6 is based 
upon numerous theory-confirming relationships 
with other measures. For example, self-ratings on 
the GQ-6 correlated modestly with external ob- 
servers’ perceptions of gratitude in the participants. 
Additional substudies indicated that the GQ-6 is 
positively related to optimism, hope, spirituality and 
religiousness, forgiveness, empathy, and prosocial 
behavior. The scale is negatively related to depres- 
sion, anxiety, materialism, and envy. 

Literally thousands of attitude measures have 
been proposed. Aiken (2002) provides information 
on dozens of carefully validated instruments. An In- 
ternet search using the phrase “Attitude Measures” 
revealed 647,000 sources, many citing unpublished 
instruments. Table 12.7 lists a sampling of the atti- 
tudinal objects surveyed by some of these measures. 


Issues in Attitude Assessment 


One of the major issues in attitude assessment is 
whether attitudes predict behavior. The literature on 
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TABLE 12.7 _ Attitudinal Objects Surveyed 


by Unpublished Scales 
Attitudes Toward... 
Alcohol use among students Homeless people 
Body parts Insurance 
Childhood illness Late work arrival 
Christianity Librarians 
Contraception Organ donation 
Electroconvulsive therapy Overweight people 
Engineering as a Physician-assisted 

vocation suicide 
Gay/lesbian parenting Psychotherapy for self 
Herpes School bullying 
HIV infection School subjects 





this topic is vast, and the findings are complex and 
multifaceted. In an early, classic study, LaPiere 
(1934) established that motel and restaurant owners 
in the United States answered attitude question- 
naires one way, but behaved in another. Specifically, 
when these individuals were asked questions such 
as, “Will you accept persons of the Chinese race as 
guests,” they answered yes, but when the researcher 
sent Chinese patrons to their establishments, they 
were refused service. Many other studies also point 
to a weak link, at best, between attitude measures 
and behavior (Aiken, 2002). 

More recently, researchers have focused on ways 
to increase the predictive validity of attitude mea- 
sures. One general theme of this research is that atti- 
tudes will be predictive if they are strongly activated 
(Greenwald & Banaji, 1995). Another finding is that 
attitudes will be predictive if the actor is highly con- 
scious of them (Myers, 2002). A recent review of at- 
titude research can be found in Ajzen (2001). 






THE ASSESSMENT OF 
| MORAL JUDGMENT 


The Moral Judgment Scale 


Kohlberg has proposed one of the few theories of 
moral development that is both comprehensive and 
empirically based (Colby, Kohlberg, Gibbs, & 


Lieberman, 1983; Kohlberg, 1958, 1981, 1984; 
Kohlberg & Kramer, 1969). Although he was more 
concerned with theory-based problems of moral de- 
velopment than with the nuances of standardized 
measurement, Kohlberg did generate a method of as- 
sessment that is widely used and intensely debated. 
We will review the underlying rationale for his mea- 
surement tool and discuss the psychometric proper- 
ties of the instrument as well. In addition, we will 
take a brief look at a more objectively based adapta- 
tion of Kohlberg’s approach known as the Defining 
Issues Test (Rest, 1979, Rest & Thoma, 1985). 


Stages of Moral Development 


Kohlberg’s theory grew out of Piaget’s (1932) stage 
theory of moral development in childhood. 
Kohlberg extended the stages into adolescence and 
adulthood. In order to explore reasoning about dif- 
ficult moral issues, he devised a series of moral 
dilemmas. One of the most famous is the dilemma 
of Heinz and the druggist: 


In Europe, a woman was near death from a special 
kind of cancer. There was one drug that the doctors 
thought might save her. It was a form of radium that 
a druggist in the same town had recently discov- 
ered. The drug was expensive to make, but the drug- 
gist was charging ten times what the drug cost him 
to make. He paid $200 for the radium and charged 
$2000 for a small dose of the drug. The sick 
woman’s husband, Heinz, went to everyone he 
knew to borrow the money, but he could only get 
together about $1000 which is half of what it cost. 
He told the druggist that his wife was dying, and 
asked him to sell it cheaper or let him pay later. But 
the druggist said, “No, I discovered the drug and 
I’m going to make money from it.” So Heinz got 
desperate and broke into the man’s store to steal the 
drug for his wife. (Kohlberg & Elfenbein, 1975) 


After reading or hearing this story, the respondent is 
asked a series of probing questions. The questions 
might be as follows: Should Heinz have stolen the 
drug? What if Heinz didn’t love his wife? Would 
that change anything? What if the person dying was 
a stranger? Should Heinz steal the drug anyway? 
Based on answers to this and other dilemmas, 
Kohlberg concluded that there are three main levels 
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of moral reasoning, with two substages within each 
level (Table 12.8). One use of his measurement in- 
strument, the Moral Judgment Scale, is to deter- 
mine a respondent’s stage of moral reasoning. ! 

The Moral Judgment Scale consists of several 
hypothetical dilemmas such as Heinz and the drug- 
gist, presented one at a time (Colby, Kohlberg, 
Gibbs, & others, 1978). In its latest revision, the 
Scale comes in three versions called Forms A, B, 
and C. Scoring is quite complex, based on the 
examiner’s judgment of responses in relation to 
extensive criteria outlined in a detailed scoring 
manual (Colby & Kohlberg, 1987). Although there 
are several different dimensions to scoring, the one 
element most frequently cited in research studies is 
the overall stage of moral reasoning that character- 
izes a respondent. 


Critique of the Moral Judgment Scale 


Early versions of the Moral Judgment Scale suf- 
fered serious shortcomings of scoring and interpre- 
tation. For example, in his doctoral dissertation, 
Kohlberg (1958) proposed two scoring systems: one 
using the sentence or completed thought as the unit 
of scoring, the other relying upon a global rating of 
all the subject’s utterances as the unit of analysis. 
Neither approach was fully satisfactory, and early 
reviews of the scale were justifiably critical of its re- 
liability and validity (Kurtines & Greif, 1974). 

In response to these criticisms, Kohlberg and 
his associates developed a scoring system that is 
unparalleled in its clarity, detail, and sophistication 
(Rest, 1986). Fortuitously, since the moral dilem- 
mas of the Moral Judgment Scale have remained 
constant over the years, it is possible to apply the 
new scoring system to old data. The capacity to re- 
analyze old data and compare it with new data is in- 
valuable in determining the reliability and validity 
of an existing scale. A most important study in this 


1. Even though the Moral Judgment Scale has been widely used 
for empirical research, Kohlberg (1981, 1984) suggests that its 
most valuable application is for the promotion of self-under- 
standing and the development of moral reasoning in the indi- 
vidual respondent. 


TABLE 12.8 Kohlberg’s Levels and Stages 
of Moral Development 


Level 1: Preconventional 

Stage 1. Punishment and obedience orientation: The 
physical consequences determine what is 
good or bad. 

Stage 2. Instrumental relativism orientation: What 
satisfies one’s own needs is good. 


Level 2: Conventional 

Stage 3. Interpersonal concordance orientation: What 
pleases or helps others is good. 

Stage 4. “Law-and-order” orientation: Maintaining the 
social order and doing one’s duty is good. 


Level 3: Postconventional or Principled 

Stage 5. Social contract-legalistic orientation: Values 
agreed upon by society determine what is 
good 

Stage 6. Universal ethical-principle orientation: What 
is right is a matter of conscience 
derived from universal principles. 





Source: Based on Kohlberg (1984). 


regard has been published by Kohlberg and associ- 
ates (Colby et al., 1983). 

This investigation reports the results of using 
the new scoring system in a longitudinal study 
spanning more than 20 years. The results are im- 
pressive and offer strong support for the reliability 
and validity of the instrument. Test-retest correla- 
tions for the three forms were in the high .90s, as 
were interrater correlations. Longitudinal scores of 
subjects tested at three- to four-year intervals over 
20 years revealed theory-consistent trends. Fifty- 
six of 58 subjects showed upward change, with no 
subjects skipping any stages. Furthermore, only 6 
percent of the 195 comparisons showed backward 
shifts between two testing sessions. The internal 
consistency of scores was also excellent: about 70 
percent of the scores were at one stage, and only 2 
percent of the scores were spread further than two 
adjacent stages. Cronbach’s alpha was in the mid- 
.90s for the three forms. These findings have been 
corroborated by Nisan and Kohlberg (1982). Heil- 
brun and Georges (1990) also report favorably 
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upon the validity of the Moral Judgment Scale, in- 
sofar as postconventional development is correlated 
with higher levels of self-control, as would be pre- 
dicted from the fact that morally mature persons 
often oppose social pressure or legal constraints. In 
sum, the Moral Judgment Scale is reliable, inter- 
nally consistent, and possesses a theory-confirming 
developmental coherence. 


The Defining Issues Test 


The Defining Issues Test (DIT) is similar to the 
Moral Judgment Scale, but incorporates a much 
simpler and completely objective scoring format 
(Rest, 1979, 1986). The examinee reads a series of 
moral dilemmas similar to those designed by 
Kohlberg, and then chooses a proper action for 
each. For example, one dilemma involves a patient 
dying a painful death from cancer. In her lucid mo- 
ments, she requests an overdose of morphine to 
hasten her death. What should the doctor do? Three 
options of the following kind are listed: 


He should give the woman a fatal overdose 
Should not give the overdose 
Can’t decide 











The examinee’s choice does not enter directly into 
the determination of the moral judgment score. The 
real purpose in forcing a choice is to cause the ex- 
aminee to think about the importance of various 
factors in making the decision. Following the 
choice of proper action, the examinee rates the im- 
portance of several factors on a five-point Likert 
scale: great, much, some, little, or no importance. 
The factors are distinct for each dilemma. The fac- 
tors differ in the level of moral judgment they sig- 
nify, ranging from Kohlberg’s stage 1 through stage 
6. In the case of the preceding dilemma, the factors 
include such matters as follows: 


_______ Whether the doctor can make it look like 
an accident 

Can society afford to let people end their 
lives when they want to 

Whether the woman’s family favors giving 
the overdose or not 








These ratings form the basis for generating sev- 
eral quantitative scores that pertain to the moral 
judgment of the examinee. The most widely used 
score is the P score, which is a percentage of princi- 
pled thinking. Reliability of the P score ranges from 
.71 to .82 in test-retest studies (Rest, 1979, 1986). 
Validity has been studied by contrasting groups 
known to differ on principled thinking. For exam- 
ple, graduate students in moral philosophy and po- 
litical science, general college students, high school 
seniors, and ninth-grade students were found to dif- 
fer appropriately and systematically on the P score. 
In longitudinal studies, significant upward trends 
were found over six years and four testings. Re- 
cently, Rest has recommended a new measure of 
moral judgment, the N2 index, calculated on the 
basis of several complex formulas that use both 
ranking and rating data. The two indices are highly 
correlated in the .90s. Nonetheless, in a retrospec- 
tive analysis of previous studies, the N2 index out- 
performed the P index by a substantial margin 
(Rest, Thoma, Narvaez, & Bebeau, 1997). 

Over 600 articles have been published on the 
Defining Issues Test (McCrae, 1985; Moreland, 
1985; Sutton, 1992). In general, the instrument is 
considered a useful alternative to Kohlberg’s Moral 
Judgment Scale, particularly for research on group 
differences in moral reasoning. However, reviewers 
do note several cautions about the DIT (Sutton, 
1992; Westbrook & Bane, 1992). First, the test uses 
two moral dilemmas from the Vietnam War and is 
therefore somewhat dated. Many young examinees 
have little knowledge of (and perhaps no interest in) 
this topic and may find it difficult to identify with 
these questions. Another dilemma—the classic 
case of whether Heinz should steal a drug to save 
his wife’s life—is also of dubious value since it has 
been widely publicized and reprinted in college 
textbooks. A significant proportion of prospective 
examinees are no longer naive about this moral 
dilemma. 

Richards and Davison (1992) have pressed the 
point that the DIT is biased against conservatively 
religious individuals. Certainly, it is well estab- 
lished that conservative or fundamentalist religious 
people tend to score lower than average on the P 
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score of the Defining Issues Test (Getz, 1984; 
Richards, 1991). According to Richards and Davi- 
son (1992), the reason for this is that stage 3 and 
stage 4 items (unintentionally) possess strong the- 
ological implications that cause fundamentalist in- 
dividuals to endorse the items, thereby lowering 
their score on the test. Consider items that tap stage 
4 reasoning, which is the “law and order” orienta- 
tion that equates “moral” with doing one’s duty and 
maintaining the social order. Whereas nonreligious 
persons might support the laws of the land (and en- 
dorse stage 4 items) because they believe that legal 
authorities define what is right and moral, religious 
minorities such as Mormons believe that sup- 
porting the laws of the land is a theological and 
religious obligation that flows directly from articles 
of faith in their religion: 
While Mormons place a high value on obeying the 
law and supporting legal authorities, this value is 
due to their theological belief that God has com- 
manded them to do so, and not because they be- 
lieve, as do true Stage 4 thinkers, that the laws of 
the land or legal authorities define what is right or 
moral. (Richards & Davison, 1992, 470) 


These researchers demonstrate empirically that cer- 
tain DIT items measure a different construct for 
conservative religious persons than for the general 
population. As a consequence, the validity of the 
test in these groups is open to question. 

A related criticism of the DIT is the dearth of 
norms pertinent to minority groups. Finally, West- 
brook and Bane (1992) argue that the technical 
manual for the DIT lacks essential details needed 
to evaluate the adequacy of the test. In spite of these 
criticisms, the DIT is a widely respected test, par- 
ticularly for research on moral reasoning. 


THE ASSESSMENT OF SPIRITUAL 
AND RELIGIOUS CONCEPTS 


Within the field of psychology, transcendent topics 
such as spiritual well-being or faith maturity never 
have received mainstream attention. Fifty years 
ago, Gordon Allport (1950) lamented that the sub- 
ject of religion “seems to have gone into hiding” 
among intellectuals and academic researchers: 


Whatever the reason may be, the persistence of re- 
ligion in the modern world appears as an embar- 
rassment to the scholars of today. Even 
psychologists, to whom presumably nothing of 
human concern is alien, are likely to retire into 
themselves when the subject is broached. (p. 1) 


The situation is little improved in contemporary 
times. For example, except for a few specialty jour- 
nals, spiritual and religious topics are virtually ab- 
sent from the psychological literature. 

Yet researchers have no right to retire from the 
field, given its significance to the average person. 
Consider these statistics on religious, belief in the 
United States, stable since 1944 when national 
polls first came into use (Hoge, 1996): 


¢ Belief in God has remained constant at about 95 
percent of the population. 

e Belief in the divinity of Jesus Christ has been en- 
dorsed by 75 to 77 percent of adults. 

e Belief in an afterlife has remained at about 75 
percent of the population. 


Comparable statistics are not available worldwide, 
but it seems likely that the percentage of believing 
individuals (whether Muslim, Buddhist, Hindu, 
Jew, or other) is very high. Most people embrace a 
spiritual perspective in life, and surely this must 
have some bearing on their adjustment, behavior, 
and outlook. 

Unfortunately, the field of psychology, includ- 
ing the specialty area of testing, largely has main- 
tained an indifference to this important aspect of 
human experience. Worse yet, in many intellectual 
circles the endorsement of spiritual or religious 
sentiments is seen as evidence of psychopathology. 
Among others, Sigmund Freud endorsed a cynical 
view of religion in his aptly titled essay, The Future 
of an Illusion (1927/1961). Yet for many persons, a 
connection with the transcendent is essential to 
meaning in life. This is especially so in times of ex- 
treme duress, as when personal annihilation knocks 
at the front door. Consider the experience of Viktor 
Frankl (1963), a Nazi death camp survivor and 
founding figure of existential psychology. At one 
point during World War II he had to surrender his 
coat with a cherished manuscript in the pockets in 
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exchange for the worn-out rags of an inmate sent to 
the gas chamber: 


Instead of the many pages of my manuscript, I 
found in a pocket of the newly acquired coat a 
single page torn out of a Hebrew prayer book, 
which contained the main Jewish prayer, Shema 
Yisrael. How should I have interpreted such a 
“coincidence” other than as a challenge to live 
my thoughts instead of merely putting them on 


paper? 


In the remainder of this topic, we take the view that 
spiritual and religious dimensions to life often 
serve constructive purposes and that assessment 
within these domains is worthy of additional study. 


The Rationale for Religious 
and Spiritual Assessment 


Academic researchers justify the assessment of re- 
ligious and spiritual dimensions as a means of pur- 
suing the truth about human behavior. Their 
primary motive is intellectual curiosity and their 
goal is to understand the role of religion and spiri- 
tuality in human affairs. But is there any practical 
reason for seeking to measure religious or spiritual 
dimensions in the individual? For example, would 
clinical practitioners such as psychologists or so- 
cial workers gain anything useful by assessing the 
religious and spiritual backgrounds of their clients? 

Richards and Bergin (1997) offer several com- 
pelling arguments in favor of therapists’ assessment 
of the religious and spiritual backgrounds of their 
clients. Specifically, this form of assessment could 
serve to 


1. Increase empathy by helping therapists to better 
understand the worldview of their clients 

2. Identify and assess the impact of healthy and un- 
healthy religious-spiritual orientations in clients 

3. Determine whether religious and spiritual beliefs 
and community can provide support to clients 

4. Identify possible spiritual interventions that can 
be used in therapy to help clients 

5. Determine whether clients possess unresolved 
spiritual doubts or concerns that need to be 
addressed 


In general, Richards and Bergin (1997) argue that 
clinicians need to address the whole person in order 
to provide the best possible response to emotional 
and psychological problems. Because most clients 
bring religious and spiritual concerns to the ther- 
apy session, clinicians who assess these dimensions 
of human existence may be better prepared to pro- 
vide effective service. 


Historical Overview on Religious Assessment 


Interest in the psychology of religion can be traced 
to the early 1900s when William James (1902) 
composed his masterpiece, The Varieties of Reli- 
gious Experience. In this book, James catalogued 
the manifold ways in which humans reveal their in- 
terest in transcendent matters. His overall conclu- 
sion was that religion is “an essential organ of our 
life, performing a function which no other portion 
of our nature can so successfully fulfill.” 

Although many writers have offered psycho- 
logical analyses of religion since the seminal writ- 
ings of James, it was not until the 1960s that scales 
for the assessment of religious variables began to 
appear (Wulff, 1996). One of the first such mea- 
sures was the Allport-Ross Religious Orientation 
scales, which proposed to assess two dimensions of 
religious expression, the intrinsic and the extrinsic 
(Allport & Ross, 1967). Intrinsically religious per- 
sons were thought to live their religion (e.g., to find 
meaning, direction, outlook), whereas extrinsically 
religious persons were believed to use their religion 
(e.g., to seek security, status, sociability). In his ear- 
lier writings on this topic, Allport referred to in- 
trinsic religious expression as a genuine or mature 
religious orientation, whereas extrinsic religious 
expression was viewed as immature. Later he 
dropped the mature-immature designations be- 
cause the labels seemed overly judgmental. 

The impetus for development of these scales 
was Allport’s distressing observation of a positive 
relationship between religiosity (in certain forms) 
and authoritarian, bigoted, prejudicial attitudes. As 
a devoutly religious person, Allport was convinced 
that intrinsically oriented religious individuals 
rarely would harbor these attitudes. After all, an es- 
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sential precept of almost every religious faith is an 
attitude of love toward one’s neighbors. In the 
Christian faith, this view is summed up in the fa- 
mous dictum “Love your neighbor as yourself” 
(Mark 12:31). Yet the evidence was overwhelming 
to Allport that at least some religious individuals 
did reveal hatred, bigotry, and prejudice toward 
their neighbors. The usual targets of these mali- 
cious attitudes were racial minorities, Jews, and ho- 
mosexual persons, among others. He reasoned that 
religious persons with intolerant attitudes pos- 
sessed a predominantly extrinsic religious orienta- 
tion; that is, their faith served external goals such 
as status in the community, belonging to an in- 
group, and the like. The investigation of this hy- 
pothesis (that extrinsically religious persons would 
be more authoritarian, bigoted, and prejudiced than 
intrinsically religious persons) required appropri- 
ate tools. For this purpose, Allport and colleagues 
developed the Religious Orientation scales. 

Examples of the kinds of items on the 11-item 
Extrinsic scale and the 9-item Intrinsic scale are as 
follows: 


« The church is important as a place to develop 
good social relationships. (Extrinsic) 
Sometimes I find it necessary to compromise my 
religious beliefs for economic reasons. (Extrinsic) 
I try hard to carry my religion over into other as- 
pects of my life. (Intrinsic) 

My religion is important because it provides 
meaning to my life. (Intrinsic) 


. 


Although originally devised in a yes-no format, 
modern applications of these scales utilize a nine- 
point continuum from (1) strongly disagree to (9) 
strongly agree (Batson, Schoenrade, & Ventis, 
1993). 

Research on the Religious Orientation scales 
has not provided strong support for Allport’s orig- 
inal hypothesis (Wulff, 1996). In fact, several stud- 
ies have shown that persons scoring high on the 
Intrinsic scale actually reveal higher levels of 
authoritarianism, close-mindedness, and prejudice 
toward African Americans, gays, and lesbians. 
Hunsberger (1995) concludes that it is not reli- 
gion per se that makes for prejudice, nor is it 
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intrinsic/extrinsic religious orientation. Instead, “it 
is the way in which religious beliefs are held that 
seems most directly associated with prejudice, and 
this is best explained by the tendency for funda- 
mentalism and right-wing authoritarianism to be 
closely linked.” Specifically, he links prejudice 
against minorities with authoritarian religious tra- 
ditions that promote an absolute truth, divide the 
world into “Good” and “Evil,” and shun complex- 
ity or doubt in their belief systems. These aspects 
of religious expression are not typically measured 
by paper-and-pencil tests. 

In a recent study, Genia (1993a) concluded that 
the combined Religious Orientation scales (con- 
sisting of 20 items) measure three factors of reli- 
gious expression and not the two proposed by 
Allport (Intrinsic and Extrinsic). She conducted a 
factor analysis of test results for 309 persons ages 
17 to 83 (mean age of 29 years) from diverse reli- 
gious traditions. The nine items on the original In- 
trinsic scale held together very well, substantiating 
a robust factor of intrinsic religious orientation. But 
the eleven items on the original Extrinsic scale 
broke down into two separate subfactors that Genia 
(1993a) labeled use of religion for personal bene- 
fits (Ep) and use of religion for social reward 
(Es). She recommended transforming the original 
instrument into a three-factor test (Intrinsic, 
Extrinsic-personal, Extrinsic-social) by adding and 
dropping a few items. However, most authorities in 
the field have concluded that Allport’s scales served 
a valuable function by identifying key dimensions 
of religious experience and spurring research but 
have now outlived their usefulness. 


Religion as Quest 


Increasingly, the conceptual basis for the dis- 
tinction between intrinsic and extrinsic religious 
orientation has been questioned. Kirkpatrick and 
Hood (1990) summarized the major theoretical 
and methodological criticisms of the scales as 
follows: 


e A lack of conceptual clarity in what the Intrinsic- 
Extrinsic scales are supposed to be measuring. 
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Are these types of motivation (i.e., the motives 
associated with religious belief and practice), or 
personality variables (i.e., pervasive aspects of 
institutional behavior or involvement), or some- 
thing else? 

e Aconfusion over the relationship between the In- 
trinsic-Extrinsic scales. In particular, are these 
opposite ends ofa single bipolar dimension, or 
do the scales measure separate dimensions (so 
that conceivably some persons could score high 
on both)? 


Other problems cited include weaknesses in the fac- 
torial structure, reliability, and construct validity of 
the scales; excessive reliance on a “good religion” 
versus “bad religion” dichotomy; and the folly of 
defining and studying religiousness independent of 
belief content (Kirkpatrick & Hood, 1990). 

In response to the limitations of the Religious 
Orientation scales, Batson and his associates (1993) 
developed a measure of a third religious orientation 
known as Quest. These researchers consider Quest 
to be a more mature and flexible religious outlook 
than the intrinsic and extrinsic orientations. Actu- 
ally, Allport recognized the elements inherent to 
this orientation, but failed to incorporate them in his 
Intrinsic scale. Religion as Quest is characterized 
by complexity, doubt, and tentativeness as ways of 
being religious. This orientation suggests 


an approach that involves honestly facing existen- 
tial questions in all their complexity, while at the 
same time resisting clear-cut, pat answers. An indi- 
vidual who approaches religion in this way recog- 
nizes that he or she does not know, and probably 
never will know, the final truth about such matters. 
Still, the questions are deemed important, and how- 
ever tentative and subject to change, answers are 
sought. There may or may not be a clear belief in a 
transcendent reality, but there is a transcendent, re- 
ligious aspect to the individual’s life. We shall call 
this open-ended, questioning approach religion as 
quest. (Batson et al., 1993, p. 166) 


Examples of the kinds of items on the 12-item 
Quest scale are as follows: 


e My life experiences have led me to reconsider 
my religious convictions. 
¢ I find religious doubts upsetting. (reverse scored) 


¢ As I grow and mature, I expect my religious be- 
liefs to change. 

e Questions are more important to my religious 
faith than answers. 


Items are scored on the same nine-point continuum 
from (1) strongly disagree to (9) strongly agree. Re- 
sults are reported as an average rating. Research 
with 424 undergraduates interested in religion in- 
dicates that Quest is, indeed, a dimension of reli- 
gious experience independent from both Intrinsic 
and Extrinsic orientations. Whereas Intrinsic and 
Extrinsic scores correlated .72, Quest revealed neg- 
ligible relationships with both scales (—.05 with In- 
trinsic and .16 with Extrinsic). 

But exactly what does the Quest scale measure? 
The intention of its authors was that it assess “the 
degree to which an individual’s religion involves an 
open-ended, responsive dialogue with existential 
questions raised by the contradictions and trage- 
dies of life” (Bateson et al., 1993, p. 169). The three 
components of the Quest orientation are (1) readi- 
ness to face existential questions without reducing 
their complexity, (2) self-criticism and perception 
of religious doubts as positive, and (3) openness to 
change. But critics have charged that the scale may 
not measure anything religious at all, that instead it 
may assess agnosticism, anti-orthodoxy, religious 
doubt, or religious conflict. 

In response to these criticisms, Batson et al. 
(1993) note the following: 


Students at Princeton Theological Seminary 
scored significantly higher (p < .001) on the 
Quest scale (mean of 6.7) than undergraduates at 
the same institution (mean of 5.2). This finding 
supports the view that the scale is a valid measure 
of something religious. 

The 32 members of a charismatic Bible study 
group scored significantly higher (p < .001) on 
the Quest scale (mean of 5.5) than the 26 members 
of a traditional Bible study group (mean of 4.6). 
The charismatic group placed emphasis on reli- 
gion as a shared search; most prayed with hands 
raised, and some members spoke in tongues. 


Quest is its own dimension of religious expression, 
and substantial research on the meaning and corre- 
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lates of this faith orientation has been completed. 
Batson et al. (1993) summarize research with the 
Quest scale by noting that it appears to measure a 
religion of less faith but more works. 

Quest arose as a response to the limitations of 
the Intrinsic and Extrinsic approach to the measure- 
ment of religious orientation. But this brief 12-item 
scale possesses its own limitations, chief among 
them its brevity and factorial simplicity. Several 
other instruments have been proposed to measure 
aspects of religious experience. We survey a few 
prominent and representative approaches in the fol- 
lowing sections. 


The Spiritual Well-Being Scale 


The concept of spiritual well-being can be traced to 
a paper by Moberg (1971) that proposed this form of 
well-being as an essential component of healthy 
aging. Spiritual well-being was conceptualized as a 
two-dimensional construct consisting of a vertical 
dimension and a horizontal dimension. The vertical 
dimension concerned well-being in relation to God 
or a higher power, whereas the horizontal dimension 
involved existential well-being, which is a sense of 
purpose in life without any specific religious refer- 
ence. The challenge of developing a scale to mea- 
sure these components of well-being was taken up 
by Ellison (1983) and Paloutzian and Ellison (1982). 


Their instrument was designated the Spiritual 
Well-Being Scale (SWB Scale). The SWB Scale 
consists of two subscales: Religious Well-Being 
(RWB), which assesses the vertical dimension of 
well-being in relation to God; and Existential Well- 
Being (EWB), which measures the horizontal di- 
mension of well-being in relation to life purpose and 
life satisfaction. Each subscale consists of 10 items 
that are scored from 1 (strongly disagree) to 6 
(strongly agree). The items from the two subscales 
are combined on the SWB Scale, with odd-num- 
bered items assessing religious well-being and even- 
numbered items assessing existential well-being. 
Some items are worded negatively; these are reverse 
scored so that a higher score always indicates 
greater well-being. Items similar to those found on 
the SWB Scale are reproduced in Table 12.9. 

The SWB Scale provides three scores: a total 
SWB score (maximum 120), a subscore for RWB 
(maximum 60), and a subscore for EWB (maximum 
60). Initial reliability and validity studies were based 
upon 206 students from three religiously oriented 
colleges and one secular university. Test-retest relia- 
bility coefficients were .93 for SWB, .96 for RWB, 
and .86 for EWB. Factor analysis tended to support 
the construct validity of the instrument by revealing 
that all of the religious items loaded on.a religious 
factor, whereas existential items appeared to load on 
two subfactors, one connoting life direction and the 


TABLE 12.9 Items Similar to Those Found on the Spiritual Well-Being Scale 


For each statement circle the choice that best indicates the degree of your agreement or 


disagreement. 

SA = Strongly Agree D = Disagree 

MA = Moderately Agree MD = Moderately Disagree 

A = Agree SD = Strongly Disagree 

I don’t find much reward in private SA MA A D MD SD 
prayer (reverse scored) 

My relationship with God helps me SA MA A D MD SD 
through hard times 

Life is inherently without meaning SA MA A D MD SD 


(reverse scored) 


I feel good about where my life is headed SA MA 
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other indicating life satisfaction. The correlation be- 
tween the RWB and EWB subscales was modest (r 
= .32), indicating that they tap separate aspects of 
spiritual well-being. 

In later writings, Ellison described the SWB 
Scale as a measure of psychospiritual personality in- 
tegration and resultant well-being (Ellison & Smith, 
1991). According to this view, well-being consists 
of “the integral experience of a person who is func- 
tioning as God intended, in consonant relationship 
with Him, with others, and within one’s self” 
(p. 36). This is the biblical notion of shalom, which 
denotes being harmoniously at peace within and 
without. If this conceptualization is correct, healthy 
spirituality as measured by the SWB Scale should 
show positive relationships with independent mea- 
sures of health and subjective well-being. Literally 
dozens of studies have investigated this broad-range 
hypothesis, with generally positive findings. Repre- 
sentative studies are summarized in Table 12.10. 

The one identified shortcoming of the SWB 
Scale is an apparent low ceiling, especially in reli- 
gious samples. Ledbetter, Smith, Vosler-Hunter, 


TABLE 12.10 Summary of Findings with the 
Spiritual Well-Being Scale 


Spiritual Well-Being Scale Scores Correlate 
Positively With: 


Being closer to ideal body weight (Hawkins & Larson, 
1984) 


Perceived health in rural elderly (DeCrans, 1990) 
Overall adjustment to hemodialysis (Campbell, 1988) 
Hope in cancer patients (Mickley, 1990) 

Measures of self-esteem (Paloutzian & Ellison, 1982) 


Spiritual Well-Being Scale Scores Correlate 
Negatively With: 
Diastolic and systolic blood pressure (Hawkins, 1988) 


Frequency and amount of pain in cancer patients 
(Granstrom, 1987) 


Social isolation and despair (Bonner, 1988) 
Aggressiveness and conflict avoidance (Bufford & 
Parker, 1985) 

Depression scores on the MMPI (Fehring, Brennan, & 
Keller, 1987) 





and Fischer (1991) caution that the clinical useful- 
ness of the scale is limited to low scores (since 
high-functioning religious persons tend to “top 
out” on the scale). They also offer suggestions for 
revision (e.g., rewording items in more extreme di- 
rections) toward the goal of increasing the ceiling 
level of the SWB Scale. Bufford, Paloutzian, and 
Ellison (1991) have published norms for the test, 
but caution that in many religious samples the 
typical individual receives the maximum score. 
This would indicate that the scale is helpful in re- 
search, but is not useful for distinguishing among 
individuals with high levels of spiritual well-being. 


The Faith Maturity Scale 


In 1987, six major Protestant denominations un- 
dertook a national four-year study of personal faith, 
denominational allegiance, and their determinants 
(Benson, Donahue, & Erickson, 1993). Funded in 
part by the Lilly Endowment, this project spawned 
what is undoubtedly the most sophisticated mea- 
sure of spiritual maturity ever conceived. The Faith 
Maturity Scale (FMS) arose as a practical tool to 
serve three research purposes: 


1. Provide baseline data on the vitality of faith in 
mainstream Protestant congregations 

2. Identify the contributions of demographic, per- 
sonal, and congregational variables to faith 
development 

3. Furnish a criterion variable for evaluating the 
impact of religious education in mainstream 
denominations 


The development of the scale was a time- 
consuming and careful process that began with a 
working definition: 


Faith maturity is the degree to which a person em- 
bodies the priorities, commitments, and perspec- 
tives characteristic of vibrant and life-transforming 
faith, as they have been understood in “mainline” 
Protestant traditions. (Benson, Donahue, & Erick- 
son, 1993, p. 3) 


Using open-ended questionnaires with a conve- 
nience sample of 410 mainline Protestant adults, the 
test developers next identified eight core dimen- 


TOPIC 12B ATTITUDES AND THE ASSESSMENT OF MORAL AND SPIRITUAL CONCEPTS 481 


sions of faith maturity. Three advisory panels pro- 
vided ongoing counsel during this stage and the next 
phase of item writing. These interactions assured 
that the scale possessed face and content validity. 
The resulting FMS is a 38-item test that 
embodies key indicators of faith maturity in eight 
core areas (Table 12.11). Items are answered on a 
seven-point scale from 1 = never true to 7 = always 
true. Based upon the areas assessed, the reader will 
notice that right belief is only one aspect of a ma- 
ture faith. In large measure, faith maturity is de- 
fined by value and behavioral consequences. As the 
authors note, the Faith Maturity Scale “parts com- 
pany with more traditional ways of defining and 
measuring personal religion.” Yet it does embody 
the kinds of behaviors and attitudes that derive 


TABLE 12.11 The Eight Core Dimensions and 
Sample Items from the Faith Maturity Scale 


A. Trusts and believes (5 items) 
Every day I see evidence that God is at work in the 
world 
B. Experiences the fruits of faith (5 items) 
I feel weighed down by all my responsibilities 
(reverse scored) 
C. Integrates faith and life (5 items) 
My faith influences how I think and act every day 
D. Seeks spiritual growth (4 items) 
I take time to meditate or pray 
E. Experiences and nurtures faith in community 
(4 items) 
I talk with others about my faith 
F. Holds life-affirming values (6 items) 
I tend to be critical of other persons (reverse 
scored) 
G. Advocates social change (4 items) 
I believe the churches of this nation should get in- 
volved in political issues 
H. Acts and serves (5 items) 
I offer significant amounts of time to help others 





Note: The sample items are similar to those on the Faith Maturity 
Scale. 


Source: Based on Benson, P., Donahue, M., & Erickson, J. (1993). 
The Faith Maturity Scale: Conceptualization, measurement, and 
empirical validation. In M. L. Lynn & D. O. Moberg (Eds.), Re- 
search in the social scientific study of religion (vol. 5). Greenwich, 
CT: JAI Press. 


from a dynamic, life-transforming faith. These be- 
haviors and attitudes are consistent with the theol- 
ogy found in most religious traditions, but are 
especially pertinent for the particular purpose of as- 
sessing faith maturity in the Protestant context. 

The FMS is scored as the mean of the 38 items, 
which yields a potential range of 1 to 7. The aver- 
age score for 3,040 adults in five Protestant de- 
nominations was 4.63, which indicates that the 
instrument avoids the “ceiling effect” found on 
other scales such as the Spiritual Well-Being Scale, 
discussed previously. The estimated reliability of 
the scale is very robust across age, gender, occu- 
pation, and denomination, with typical coefficient 
alphas of .88 (Benson et al., 1993). Test-retest reli- 
ability was not reported. 

The validity of the scale is supported by several 
lines of evidence, beginning with the careful ap- 
proach to item selection, by which face validity and 
content validity were built-in. Construct validity 
was demonstrated in several ways. First, it was pre- 
dicted and confirmed that groups presumed to dif- 
fer in levels of faith maturity would obtain 
significantly different mean scores on the FMS. In- 
deed, pastors scored the highest (5.3), followed by 
church education coordinators (4.9), teachers (4.7), 
adults (4.6), and youth (4.1)—each group in re- 
spective order scoring significantly lower than the 
others. Second, pastors’ ratings of the faith maturity 
of 123 congregation members on a 1 to 10 scale 
correlated very substantially (r = .61) with the FMS 
scores of these persons, indicating a correspon- 
dence between independent expert ratings and self- 
report. The scale also revealed predictive utility. 
Specifically, FMS scale scores were strongly re- 
lated to a variety of prosocial behaviors such as do- 
nating time to help those who are poor, hungry, or 
sick; promoting a greater role for women in the 
church; and endorsing the use of foreign policy to 
challenge apartheid. 

One cautionary note is that the susceptibility of 
the scale to response sets (e.g., “fake good”) is sim- 
ply unknown. The authors of the test call for further 
research to examine yea-saying, social desirability, 
and other response sets. They also offer a refresh- 
ing modesty in discussing their 38-item test: 
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It would be presumptuous to claim that these 38 are 
the final word on what defines mature faith. Over 
time, we hope that the Faith Maturity Scale itself ma- 
tures, being further informed and modified through 
interaction with an expanded range of researchers 
and reactors representing a widening circle of reli- 
gious traditions. (Benson et al., 1993, p. 24) 


Overall, the FMS holds great promise as a measure 
of faith maturity. A special virtue of the test is that it 
avoids the ceiling effect so common in other mea- 
sures of spiritual variables. This feature guarantees 
that the instrument will prove useful in examining 
the effects of educational programs as well as 
changes over time in religiously devout individuals. 


The Spiritual Experience Index 


Measures of spiritual and religious functioning 
largely have been developed from a Christian in- 
terpretation of faith, with the result that applica- 
tions cannot be extended to persons whose spiritual 
beliefs are rooted in any other ideology. A notable 
exception is the Spiritual Experience Index (SEI), 
which derives from a developmental view of faith. 
This test is based upon the theory that faith is a de- 
velopmental phenomenon that progresses from the 
highly egocentric religiosity of childhood to the 
transcendent faith of middle adulthood—when 
nurtured by favorable psychosocial conditions. The 
developmental theory was first elucidated by 
Fowler (1981) and later revised by Genia (1990). A 
summary of Genia’s five stages of faith is included 
in Table 12.12. 

This highest level of religious maturity (found 
in all the major faiths) is characterized by ten cri- 
teria (Genia, 1993b): 


1. Transcendent relationship to something greater 
than oneself 
2. Consistency of lifestyle and behavior with spir- 
itual values 
. Commitment without absolute certainty 
. Openness to spiritually diverse viewpoints 
. Lack of magical thinking and anthropomor- 
phic God concepts 
6. Inclusion of both rational and emotional 
components 


mp WwW 


TABLE 12.12 The Stages of Religious Faith 


Stage 1: Egocentric faith 

Characteristic of persons with immature personality de- 
velopment, egocentric faith is narcissistic, based on an- 
ticipated reward or punishment. The divine image is 
anthropomorphic, and prayer is petitionary. 


Stage 2: Dogmatic faith 

Religious dogma is used rigidly and defensively as a 
means of psychological support. Scripture is inter- 
preted literally and absolutely. Prayer may take the 
form of bargaining with God. 


Stage 3: Transitional faith 

Characteristic of many adolescents, the individual rec- 
ognizes a freedom to engage in questioning and doubt. 
Religious experimentation such as “trying on” different 
ideologies or switching denominations may occur. 


Stage 4: Reconstructed Internalized Faith 

There is commitment to a self-chosen faith that tran- 
scends egocentric concerns. The chosen ideology pro- 
vides a sense of purpose and meaning in life. Religious 
doctrine is more complex and prayers feature thanks- 
giving, praise, and devotion. 


Stage 5: Transcendent Faith 

The individual maintains a transcendent relationship to 
something greater than the self. There is commitment 
without absolute certainty, and the person’s style of liv- 
ing is consistent with his/her religious values. 





Source: Based on Genia, V. (1990). Religious development: A 
synthesis and reformulation. Journal of Religion and Health, 29, 
85-99. 


7. Social interest and humanitarian concern 
8. Mature faith is life-enhancing and growth- 
producing 
9. Provision of meaning and purpose in life 
10. Lack of dependence upon particular practices 
or formal religious structure 


The purpose of the SEI scale is to assess the de- 
gree of spiritual maturity (conceived as a unidi- 
mensional construct) for persons from diverse 
religious and spiritual traditions. The items of the 
scale were developed from these 10 criteria by 
means of rational scale construction (meaning that 
the test author devised several items pertinent to 
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each of the criteria). A try-out sample of 75 persons 
(40 percent Roman Catholic, 28 percent Protestant, 
23 percent Jewish, and 9 percent unaffiliated) com- 
pleted a preliminary scale of 50 items.. Twelve 
items were dropped due to low item-total correla- 
tions, leaving a scale of 38 items. 

Examples of items similar to those on the SEI 
include the following: 


My faith provides meaning and purpose to my 
life. 

Usually a moral dilemma has only one right 
solution. (reverse scored) 

I sense a strong spiritual connection with all of 
humankind. 

My faith helps me to deal with tragedy and 
suffering. 


The item format is a six-point Likert-type scale 
from 1 = strongly disagree to 6 = strongly agree. 
The score is reported as the total raw score, with a 
range of 38 to 228. For the initial sample, scores 
formed a normal distribution ranging from 103 to 
211 with a mean of 166. 

Internal consistency of the SEI is reported to be 
.87 (coefficient alpha), whereas test-retest stability 
was not tested. When the 38 items were entered into 
a Principle Axis Factoring Analysis, the scale held 


up as a unidimensional 'construct—no factors 
emerged that explained a meaningful portion of the 
variance. Thus, the findings indicate good internal 
reliability for the SEI and support its use as a uni- 
dimensional measure. 

Validity inspection consisted of correlational 
analyses with a variety of other tests and measures. 
Strong correlations. were noted with several vari- 
ables: dogmatism (r = —.52), which indicates that 
high scale scores go with low dogmatism; Quest 
scores (r = .44), which suggests that the scale in- 
cludes aspects of complexity and doubt; intrinsic 
religiosity (r = .43), which indicates that strong 
spiritual experience promotes a mature religious 
faith; intolerance of ambiguity (r = —.40), which 
argues that strong spiritual experience allows for 
tolerance of ambiguity; and frequency of worship 
(r = .24), which is intriguing because none of the 
SEI test items pertain to this behavioral index of 
faith. 

Overall, these findings support the reliability 
and validity of this scale, but additional research is 
needed to examine its psychometric properties. 
Further analysis is especially desirable insofar as 
the SEI occupies a special niche in the assessment 
of spiritual variables—it is one of the few scales 
that is applicable across diverse religious traditions. 


SUMMARY 


1. An attitude is a learned cognitive, affective, 
or behavioral predisposition to respond positively 
or negatively to certain objects, situations, institu- 
tions, concepts, or persons. Attitudes have an eval- 
uative component and serve motivational functions 
by helping individuals organize their perceptions 
and make sense out of the world. 


2. Three broad approaches to the assessment 
of attitudes are behavioral, covert, and question- 
naire. In the behavioral approach, the respondent’s 
willingness to take an action (e.g., donate money to 
a cause) is used to gauge a positive attitude. Covert 
approaches to attitude assessment involve unobtru- 
sive procedures such as the lost-letter technique (re- 
turn rates used to gauge public attitudes) and 


pupillometrics (pupil enlargement used to gauge in- 
dividual attitudes). 

3. An implicit association test is another ex- 
ample of a covert measure of attitudes. In an im- 
plicit association test, the researcher uses relative 
reaction times to measure the automatic or “un- 
conscious” associations of individuals to different 
target concepts presented on a computer screen. 


4. The paper-and-pencil questionnaire em- 
ploying a Likert scale response format (5- or 7- 
point continuum from “strongly agree” to “strongly 
disagree”) is the mainstay approach to attitude 
measurement. A good questionnaire will possess 
strong internal consistency as measured by coeffi- 
cient alpha or similar index. 
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5. The link between attitude measures and be- 
havior generally has proven to be weak. The link is 
stronger when attitudes are strongly activated and 
when the actor is highly conscious of his or her 
attitudes. 


6. With Kohlberg’s Moral Judgment Scale, 
the examinee is asked a series of structured ques- 
tions pertaining to several moral dilemmas. Re- 
sponses are categorized according to six stages and 
three levels of development: preconventional, con- 
ventional, and postconventional. 


7. Rest’s Defining Issues Test is a spinoff of 
the Moral Judgment Scale that uses a completely 
objective scoring format. The test yields several 
quantitative indices, including the P score (per- 
centage of principled thinking) and the N2 index 
(based on complex formulas), both of which show 
good reliability and validity. 


8. A relatively neglected area in assessment 
is the evaluation of spiritual and religious dimen- 
sions. This is unfortunate because assessment of 
these variables could have important implications 
for individual clients. 


9. One of the first forms of religious assess- 
ment was the Allport-Ross Religious Orientation 
scales. These scales popularized the concepts of in- 
trinsic religiousness (persons live their religion to 
find meaning and direction) versus extrinsic reli- 


giousness (persons use their religion to seek secu- 
rity or status). 


10. The Quest scale seeks to measure a third 
religious orientation (beyond intrinsic and extrin- 
sic) characterized by complexity, doubt, and tenta- 
tiveness as ways of being religious. This simple 
12-item scale appears to measure a religion of less 
faith but more works. 


11. The Spiritual Well-Being Scale is a main- 
stay in the field of religious assessment. It consists 
of two subscales, Religious Well-Being (a vertical 
dimension of well-being in relation to God) and 
Existential Well-Being (a horizontal dimension of 
well-being in relation to life purpose and satisfac- 
tion). A problem with this scale is its low ceiling. 


12. The Faith Maturity Index is an ambitious 
scale devised at the behest of six major Protestant 
denominations. The 38 items are answered on a 
seven-point scale, providing an index of a vibrant 
and life-transforming faith, as understood in main- 
line Protestant faiths. 


13. The Spiritual Experience Index was de- 
signed to measure spiritual maturity from a devel- 
opmental view of faith, independent of any 
particular creed or religion. The 38 items are an- 
swered on a six-point scale, yielding a unitary 
index of spiritual maturity, broadly defined. 
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Key Terms and Concepts 


I: psychological testing a fundamental distinc- 
tion often is drawn between ability tests and per- 
sonality tests. Defined in the broadest sense, ability 
tests include the plethora of instruments for mea- 
suring intelligence, achievement, aptitude, and neu- 
ropsychological functions. In the preceding 12 
chapters we have explored the nature, construction, 
application, reliability, and validity of these instru- 
ments. In the next two chapters we shift the em- 
phasis to personality tests. Personality tests seek to 
measure one or more of the following: personality 
traits, dynamic motivation, personal adjustment, 
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psychiatric symptomatology, social skills, and atti- 
tudinal characteristics. This chapter investigates the 
origins of personality testing. In Topic 13A, Theo- 
ries and the Measurement of Personality, the 
different ways in which researchers have concep- 
tualized personality are surveyed to illustrate how 
their theories have impacted the design of personal- 
ity tests and assessments. In Topic 13B, Projective 
Techniques, we examine the multiplicity of instru- 
ments based upon the turn-of-the-twentieth-century 
psychoanalytic hypothesis that responses to am- 
biguous stimuli reveal the innermost, unconscious 
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mental processes of the examinee. The coverage of 
personality assessment continues in the next chap- 
ter with a review of objective tests and procedures, 
including self-report inventories and behavioral as- 
sessment approaches. 


I] PERSONALITY: AN OVERVIEW 


Although personality is difficult to define, we can 
distinguish two fundamental features of this vague 
construct. First, each person is consistent to some 
extent; we have coherent traits and action patterns 
that arise repeatedly. Second, each person is dis- 
tinctive to some extent; behavioral differences exist 
between individuals. Consider the reactions of 
three graduate students when their midterm exam- 
inations were handed back. Although all three stu- 
dents received nearly identical grades (solid Bs), 
personal reactions were quite diverse. The first stu- 
dent walked off sullenly and was later overheard to 
say that a complaint to the departmental adminis- 
trator was in order. The second student was pleased, 
stating out loud that a B was, after all, a respectable 
grade. The third student was disappointed but sto- 
ical. He blamed himself for not studying harder. 

How are we to understand the different reac- 
tions of these three persons, each of whom was re- 
sponding to an identical stimulus? Psychologists 
and laypersons alike invoke the concept of per- 
sonality to make sense out of the behavior and 
expressed feelings of others. The notion of person- 
ality is used to explain behavioral differences be- 
tween persons (for example, why one complains 
and another is stoical) and to understand the be- 
havioral consistency within each individual (for 
example, why the complaining student noted pre- 
viously was generally sour and dissatisfied). 

In addition to understanding personality, psy- 
chologists also seek to measure it. Literally hun- 
dreds of personality tests are available for this 
purpose; we will review historically prominent in- 
struments and also discuss some promising new ap- 
proaches. However, in order that the reader can 
better comprehend the diversity of instruments and 
approaches, we begin with a more fundamental 
question: How is personality best conceptualized? 


As the reader will discover, in order to measure per- 
sonality we must first envision what it is we seek to 
measure. The reader will better appreciate the mul- 
tiplicity of tests and procedures if we also briefly 
describe the personality theories which comprise 
the underpinnings for these instruments. We close 
out this topic by raising a general question perti- 
nent to all theories and testing approaches: How 
stable and predictable is behavior? 

Although we partition personality tests sepa- 
rately from the ability tests, the distinction between 
these two kinds of instruments is far from absolute. 
Intellectual ability is, in part, a characterological 
feature based on such attributes as perseverance 
and self-control. Thus, ability tests inevitably tap 
important dimensions of personality, albeit in an in- 
direct and imperfect manner. Often, the converse is 
also true: Personality tests may be saturated with 
ability factors. For example, certain personality di- 
mensions such as openness to experience probably 
correlate positively with intelligence. As the reader 
will discover in the next chapter, some true-false 
personality inventories incorporate a very robust in- 
telligence factor (e.g., Cattell, Eber, & Tatsuoka, 
1970). 


PSYCHOANALYTIC THEORIES 
OF PERSONALITY 


Psychoanalysis was the original creation of Sig- 
mund Freud (1856-1939). While it is true that 
many others have revised and adapted his theo- 
ries, the changes have been slight in comparison to 
the substantial foundations that can be traced to this 
singular genius of the Victorian and early-twenti- 
eth-century era. Freud was enormously prolific in 
his writing and theorizing. We restrict our discus- 
sion to just those aspects of psychoanalysis that 
have influenced psychological testing. In particu- 
lar, the Rorschach, the Thematic Apperception Test, 
and most of the projective techniques critiqued in 
the next topic dictate a psychoanalytic framework 
for interpretation. Readers who wish a more thor- 
ough review of Freud’s contributions can start with 
the New Introductory Lectures on Psychoanalysis 
(Freud, 1933). Reviews and interpretations of 
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Freud’s theories can be found in Stafford-Clark 
(1971) and Fisher and Greenberg (1984). 


Origins of Psychoanalytic Theory 


Freud began his professional career as a neurolo- 
gist, but was soon specializing in the treatment of 
hysteria, an emotional disorder characterized by 
histrionic behavior and physical symptoms of psy- 
chic origin such as paralysis, blindness, and loss of 
sensation. With his colleague Joseph Breuer, Freud 
postulated that the root cause of hysteria was buried 
memories of traumatic experiences such as child- 
hood sexual molestation. If these memories could 
be brought forth under hypnosis, a release of emo- 
tion called abreaction would take place and the hys- 
terical symptoms would disappear, at least briefly 
(Studies on Hysteria, Breuer & Freud, 1893-1895). 
From these early studies Freud developed a 
general theory of psychological functioning with 
the concept of the unconscious as its foundation. He 
believed that the unconscious was the reservoir of 
instinctual drives and a storehouse of thoughts and 
wishes that would be unacceptable to our conscious 
self. Thus, Freud argued that our most significant 
personal motivations are largely beyond conscious 
awareness. The concept of the unconscious was 
discussed in elaborate detail in his first book (The 
Interpretation of Dreams, Freud, 1900). Freud be- 
lieved that dreams portray our unconscious motives 
in a disguised form. Even a seemingly innocuous 
dream might actually have a hidden sexual or ag- 
gressive meaning, if it is interpreted correctly. 
Freud’s concept of the unconscious penetrated 
the very underpinnings of psychological testing 
early in the twentieth century. An entire family of 
projective techniques emerged, including inkblot 
tests, word association approaches, sentence com- 
pletion techniques, and story-telling (apperception) 
techniques (Frank, 1939, 1948). Each of these meth- 
ods was predicated on the assumption that uncon- 
scious motives could be divined from an examinee’s 
responses to ambiguous and unstructured stimuli. In 
fact, Rorschach (1921) likened his inkblot test to an 
X ray of the unconscious mind. Although he 
patently overstated the power of projective tech- 


niques, it is evident from Rorschach’s view that the 
psychoanalytic conception of the unconscious had 
a strong influence on testing practices. 


The Structure of the Mind 


Freud’s views on the structure of the mind and the 
operation of defense mechanisms also influenced 
psychological testing and assessment (New Intro- 
ductory Lectures on Psychoanalysis, Freud, 1933). 
Several tests and assessment approaches discussed 
in this chapter are predicated upon the psychoana- 
lytic conception of defense mechanisms, so this 
topic deserves brief summary. 

Freud divided the mind into three structures: 
the id, the ego, and the superego. The id is the ob- 
scure and inaccessible part of our personality that 
Freud likened to “a chaos, a cauldron of seething 
excitement.” Because the id is entirely uncon- 
scious, we must infer its characteristics indirectly 
by analyzing dreams and symptoms such as anxi- 
ety. From such an analysis, Freud concluded that 
the id is the seat of all instinctual needs such as for 
food, water, sexual gratification, and avoidance of 
pain. The id has only one purpose, to obtain imme- 
diate satisfaction for these needs in accordance 
with the pleasure principle. The pleasure principle 
is the impulsion toward immediate satisfaction 
without regard for values, good or evil, or morality. 
The id is also incapable of logic and possesses no 
concept of time. The chaotic mental processes of 
the id are therefore unaltered by the passage of 
time, and impressions that have been pushed down 
into the id “‘are virtually immortal and are preserved 
for whole decades as though they had only recently 
occurred” (Freud, 1933). 

If our personality consisted only of an id striv- 
ing to gratify its instincts without regard for reality, 
we would soon be annihilated by outside forces. 
Fortunately, soon after birth part of the id develops 
into the ego or conscious self. The purpose of the 
ego is to mediate between the id and reality. The 
ego is part of the id and servant to it, but the ego 
“interpolates between desire and action the pro- 
crastinating factor of thought” (Freud, 1933). Thus, 
the ego is largely conscious and obeys the reality 
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principle; it seeks realistic and safe ways of dis- 
charging the instinctual tensions which are con- 
stantly pushing forth from the id. 

The ego must also contend with the superego, 
the ethical component of personality that starts to 
emerge in the first five years of life. The superego 
is roughly synonymous with conscience and com- 
prises the societal standards of right and wrong that 
are conveyed to us by our parents. The superego is 
partly conscious; but a large part of it is uncon- 
scious; that is, we are not always aware of its exis- 
tence or operation. The function of the superego is 
to restrict the attempts of the id and ego to obtain 
gratification. Its main weapon is guilt, which it uses 
to punish the wrongdoings of the ego and id. Thus, 
it is not enough for the ego to find a safe and real- 
istic way for the gratification of id strivings. The 
ego must also choose a morally acceptable outlet, 
or it will suffer punishment from its overseer, the 
superego. This explains why we may feel guilty for 
immoral behavior such as theft even when getting 
caught is impossible. Another part of the superego 
is the ego ideal, which consists of our aims and as- 
pirations. The ego measures itself against the ego 
ideal and strives to fulfill its demands for perfec- 
tion. If the ego falls too far short of meeting the 
standards of the ego ideal, a feeling of guilt may re- 
sult. We commonly interpret this feeling as a sense 
of inferiority (Freud, 1933). 


The Role of Defense Mechanisms 


The ego certainly has a difficult task, acting as me- 
diator and servant to three tyrants: id, superego, and 
external reality. It may seem to the reader that the 
task would be essentially impossible and that the in- 
dividual would therefore be in a constant state of 
anxiety. Fortunately, the ego has a set of tools at its 
disposal to help carry out its work, namely, mental 
strategies collectively labeled defense mechanisms. 

Defense mechanisms come in many varieties, 
but they all share three characteristics in common. 
First, their exclusive purpose is to help the ego re- 
duce anxiety created by the conflicting demands of 
id, superego, and external reality. In fact, Freud felt 
that anxiety was a signal telling the ego to invoke 


one or more defense mechanisms in its own behalf. 
Defense mechanisms and anxiety are therefore 
complementary concepts in psychoanalytic theory, 
one existing as a counterforce to the other. The sec- 
ond common feature of defense mechanisms is that 
they operate unconsciously. Thus, even though de- 
fense mechanisms are controlled by the ego, we are 
not aware of their operation. The third characteris- 
tic of defense mechanisms is that they distort inner 
or outer reality. This property is what makes them 
capable of reducing anxiety. By allowing the ego to 
view a challenge from the id, superego, or external 
reality in a less-threatening manner, defense mech- 
anisms help the ego avoid crippling levels of anxi- 
ety. Of course, because they distort reality, the 
rigid, excessive application of defense mechanisms 
may create more problems than it solves. 


Assessment of Defense Mechanisms 
and Ego Functions 


Although Freud introduced the concept of defense 
mechanisms, it was left to his followers to elucidate 
these unconscious mental strategies in more detail 
(Paulhus, Fridhandler, & Hayes, 1997). An early 
portrayal of defense mechanisms was provided by 
Freud’s daughter, Anna (The Ego and the Mecha- 
nisms of Defense, A. Freud, 1946). However, the ap- 
plication of these concepts to psychological 
measurement and assessment is much more recent. 
For example, Loevinger (1976, 1979, 1984) has 
produced a sentence completion technique for mea- 
suring ego development that is based, indirectly, on 
the analysis of defense mechanisms. This interest- 
ing approach to personality measurement is out- 
lined briefly in the next unit. Here we will present 
Vaillant’s (1977, 1992) work to illustrate the mea- 
surement of defense mechanisms and the applica- 
tion of this information to the understanding of 
personality. 

Vaillant (1971) developed a hierarchy of ego 
adaptive mechanisms based on the assumption 
that some defensive mechanisms are intrinsically 
healthier than others. In his view, defense mecha- 
nisms can be grouped into four different types. 
Listed in order of increasing healthiness, the types 
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are psychotic, immature, neurotic, and mature 
(Table 13.1). Psychotic mechanisms such as gross 
denial of external reality are the least healthy be- 
cause they distort reality to an extreme degree. 
They appear “crazy” to the beholder. Immature 
mechanisms such as the projection of one’s own 
unacknowledged feelings to others are healthier 
than psychotic mechanisms. Nonetheless, they are 
easily detected by outside observers and seen as un- 
desirable. Neurotic defense mechanisms typically 
alter private feelings so that they are less threaten- 
ing. An example is intellectualization, a defense 
mechanism in which threatening matters are ana- 
lyzed in bland terms that are void of feelings. For 
example, a physician whose mother died recently 
might talk at great length about the medical char- 
acteristics of her cancer, thereby easing his sense 
of loss. Mature mechanisms of defense appear to 
the beholder as convenient virtues. An example is 
certain forms of humor which do not distort reality 
but which case ease the burden of matters “too ter- 
rible to be borne” (Vaillant, 1977). 

The application of defense mechanisms to the 
understanding of personality is illustrated in the 
Grant Study, a 45-year follow-up study conducted 
by Vaillant and others (Vaillant, 1977; Vaillant & 
Vaillant, 1990). These researchers used structured 
interviews to obtain evidence of unconscious adap- 
tive mechanisms from a sample of 95 men. The 
subjects were from an original sample of 268 stu- 
dents from Harvard University’s classes of 1939 
through 1944. At follow-up, Vaillant interviewed 
each participant for two hours, using a semistruc- 
tured interview schedule (Vaillant, 1977, App. B). 
In addition, the subjects filled out autobiographical 
questionnaires and provided other sources of infor- 
mation. The entire protocol for each subject was 
then evaluated by Vaillant and other raters accord- 
ing to the extent that each defense mechanism 
characterized the individual’s adaptation to life. 
Defense mechanisms were scored from 1 (absent) 
to 5 (major). Here is an example of one uncon- 
scious adaptive behavior: 

A California hematologist developed a hobby of 


cultivating living cells in test tubes. In a recent in- 
terview, he described with special interest and ani- 


mation an unusually interesting culture that he had 
grown from a tissue biopsy from his mother. Only 
toward the end of the interview did he casually re- 
veal that his mother had died from a stroke only 
three weeks previously. His mention of her death 
was as bland as his description of her still-living 
tissue culture had been affectively colored. Inge- 
niously and unconsciously, he had used his hobby 
and his special skills as a physician to mitigate tem- 
porarily the pain of his loss. Although his mother 
was no longer alive, by shifting his attention he was 
still able to.care for her. There was nothing morbid 
in the way he told the story; and because ego mech- 
anisms are unconscious, he had no idea of his de- 
fensive behavior. Many of the healthiest men in the 
Study used similar kinds of attention shifts or dis- 
placement. Unless specifically looked for by a 
trained observer, such behavior goes unnoticed 
more often than not. (Vaillant, 1977) 


Most likely, this individual would receive a rating 
of 5 (major) for the neurotic defense mechanism of 
displacement. 

Considering the degree of skilled judgment re- 
quired by the evaluation task, the interrater relia- 
bility of the defense mechanism ratings was—with 
a few exceptions—respectable. The individual de- 
fense mechanisms possessed reliabilities that 
ranged from .53 (Fantasy) to .96 (Projection); most 
reliabilities were in the .70s and .80s. Reliability of 
a global rating (reflecting the ratio of mature to im- 
mature ratings) was .77. 

The validity of defense mechanism ratings 
hinges mainly on the demonstration that devel- 
opmental changes and group differences are con- 
sistent with psychoanalytic theory regarding these 
constructs. We would expect, for example, that the 
Grant Study subjects would use fewer immature 
and more mature defense mechanisms as they 
grew into middle age, and this is precisely what 
Vaillant discovered. In addition, we would expect 
that persons found to be maladjusted by other 
criteria (e.g., frequent divorce, underachievement) 
would rate less favorably on defense mechanisms 
in comparison to adjusted persons, and this is 
also what Vaillant observed. In sum, the analysis of 
defense mechanisms is a promising approach to 
personality assessment. However, this approach 
does have two drawbacks: The examiner needs 
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TABLE 13.1 Levels of Defense Mechanisms Proposed by Vaillant (1977) 





E 


IV. 


Psychotic 

Delusional Projection: frank delusions about external reality, usually of a persecu- 
tory nature 

Denial: denial of external reality; e.g., failing to acknowledge that one has a 
terminal illness 

Distortion: grossly reshaping external reality to suit inner needs; e.g., wish- 
fulfilling delusions 


Immature 

Projection: attributing one’s own unacknowledged feelings to others; e.g., “You’re 
angry, not me!” 

Schizoid Fantasy: use of fantasy and inner retreat for the purpose of conflict resolu- 
tion and gratification 

Hypochondriasis: transforming reproach toward others first into self-reproach then 
into complaints of physical illness 


Passive-Aggressive Behavior: aggression toward others expressed indirectly and in- 
effectively through passivity or directed against the self 


Acting Out: direct expression of an unconscious wish or impulse in order to avoid 
being conscious of the feeling that accompanies it 


“Neurotic” 

Intellectualization: thinking about wishes in formal, unfeeling terms, but not acting 
upon them 

Repression: seemingly inexplicable memory lapses or failure to acknowledge in- 
formation; e.g., “forgetting” a dental appointment 

Displacement: directing of feelings toward something or someone other than the 
real object; e.g., kicking the dog when angry with the boss 

Reaction Formation: unconsciously turning an impulse into its opposite; e.g., over- 
solicitousness to a hated coworker 


Dissociation: temporary but drastic modification of one’s character to avoid emo- 
tional distress; e.g., a brief devil-may-care attitude 


Mature 

Altruism: vicarious but constructive and gratifying service to others; 

e.g., philanthropy 

Humor: playful acknowledgment of ideas and feelings without discomfort and 
without unpleasant effects on others; does not include sarcasm 


Suppression: conscious or semiconscious decision to postpone paying attention to a 
conscious conflict or impulse 

Anticipation: realistic anticipation of or planning for future inner discomfort; e.g., 
realistic anticipation of surgery or separation 

Sublimation: indirect expression of instinctual wishes without adverse conse- 
quences or loss of pleasure; e.g., channeling aggression into sports 





Source: Based on Vaillant, G. (1977). Adaptation to life: How the best and the brightest came of age. 
Boston: Little, Brown. 
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specialized training to recognize defense mecha- 
nisms, and the process of collecting relevant infor- 
mation from examinees is very time-consuming. 


| TYPE THEORIES OF PERSONALITY 


The earliest personality theories attempted to sort 
individuals into discrete categories or types. For ex- 
ample, the Greek physician Hippocrates (ca. 460- 
377 B.C.) proposed a humoral theory with four per- 
sonality types (sanguine, choleric, melancholic, and 
phlegmatic) that was too simplistic to be useful. In 
the 1940s, Sheldon and Stevens (1942) proposed a 
type theory based upon the relationship between 
body build and temperament. Their approach stim- 
ulated a flurry of research and then faded into ob- 
scurity. Nonetheless, typological theories have 
continued to capture intermittent interest among 
personality researchers. We will illustrate type the- 
ories by reviewing contemporary research on coro- 


nary-prone personality types. 


Type A Coronary-Prone Behavior Pattern 


Friedman and Rosenman (1974) investigated the 
psychological variables that put individuals at 
higher risk of coronary heart disease. They were 
the first to identify a Type A coronary-prone 
behavior pattern, which they described as “an 
action—emotion complex that can be observed in 
any person who is aggressively involved in a 
chronic, incessant struggle to achieve more and 
more in less and less time, and if required to do so, 
against the opposing efforts of other things or per- 
sons” (Friedman & Rosenman, 1974). At the op- 
posite extreme is the Type B behavior pattern, 
characterized by an easygoing, noncompetitive, re- 
laxed lifestyle. Of course, people vary along a con- 
tinuum from “pure” Type A to “pure” Type B. 

Friedman and Ulmer (1984) have listed the spe- 
cific components of the full-fledged Type A be- 
havior pattern: 


e Insecurity of status: A hidden lack of self-esteem 
seems to plague many Type A persons. No mat- 
ter how successful, they often compare them- 
selves unfavorably to other superachievers. 


Hyperaggressiveness: A desire to dominate oth- 
ers and damage their self-esteem is part of the 
pattern. Type A persons are often indifferent to 
the feelings or rights of competitors. 
Free-floating hostility: The Type A person finds 
too many things to get upset about, and the anger 
is out of proportion to the situation. 

Sense of time urgency (hurry sickness): This.in- 
cludes two basic strategems: speeding up daily 
activities (one Type A used an electric shaver in 
each hand!), and doing two things at once such 
as conversing on the phone while reviewing cor- 
respondence. 


Type A behavior can be diagnosed from a short 
interview consisting of questions about habits of 
working, talking, eating, reading, and thinking 
(Friedman, 1996). The more flagrant cases of Type 
A behavior can also be detected by paper-and-pencil 
tests (Jackson & Gray, 1987; Jenkins, Zyzanski, & 
Rosenman, 1971, 1979), which we discuss in the 
next chapter. However, the questionnaire approach 
is limited because it cannot reveal the facial, vocal, 
and psychomotor indices of hostility and time ur- 
gency that are usually evident in interview (Fried- 
man & Ulmer, 1984). 

Early studies indicated that persons who exhib- 
ited the Type A behavior pattern were at greatly in- 
creased risk of coronary disease and heart attack. In 
one 9-year study of more than 3,000 healthy men, 
persons with the Type A behavior pattern were 2% 
times more likely to suffer heart attacks than those 
with Type B behavior pattern (Friedman & Ulmer, 
1984). In fact, not one of the “pure” Type Bs—the 
extremely relaxed, easygoing, and noncompetitive 
members of the study—had suffered a heart attack. 
In the famous Framingham longitudinal study, 
Type A men ages 55 to 64 were about twice as 
likely at 10-year follow-up to develop coronary 
heart disease as Type B men (Haynes, Feinleib, & 
Eaker, 1983). In this study, the link between Type 
A behavior and heart disease was especially strong 
for white-collar workers. 

In more recent studies, researchers have found 
only a weak relationship—or no relationship at 
all—between Type A behavior and coronary heart 
disease (e.g., Eaker & Castelli, 1988; Mathews & 
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Haynes, 1986; Smedslund & Rundmo, 1999). 
Other researchers have found that heart disease is 
linked not so much with the full-blown Type A be- 
havior pattern as it is with specific components 
such as being anger-prone (Dembroski, Mac- 
Dougall, Williams, & Haney, 1985) or possessing 
time urgency (Wright, 1988). Certainly, there is a 
need to sort out the specific risk factors in this area 
of investigation. In a review of current thinking, 
Wielgosz and Nolan (2000) identify hostility, cyn- 
icism, and suppression of anger, as well as stress, 
depression, and social isolation, as significant risk 
factors in Type A behavior. Good reviews of the 
complex and confusing research on Type A behav- 
ior can be found in Brannon and Feist (1992) and 
Wiebe and Smith (1997). 

Research on Type A behavior has sparked a re- 
newed and more sophisticated interest in typologi- 
cal conceptions of personality. Rather than viewing 
types as separate pigeonholes, psychologists have 
come to view them as idealized examples that oc- 
cupy the end points of continuous dimensions. In- 
dividuals can thus differ with respect to how much 
of an idealized personality type that they possess. 
This is similar to the trait conception of personal- 
ity discussed later. Perhaps the main difference is 
that modern type theorists tend to believe that most 
individuals are near to the idealized types atthe end 
of each dimension, whereas trait theorists argue 
that people are more likely to be found at all points 
along each personality continuum. In practice, 
then, the modern distinction between types and 
traits is relative, not absolute. 


OF PERSONALITY 


Phenomenological theories of personality empha- 
size the importance of immediate, personal, sub- 
jective experience as a determinant of behavior. 
Some of the theoretical positions subsumed under 
this title have been given other labels also, such as 
humanistic theories, existential theories, construct 
theories, self-theories, and fulfillment theories 
(Maddi, 2000). Nonetheless, these approaches 


il PHENOMENOLOGICAL THEORIES 


share a common focus on the person’s subjective 
experience, personal worldview, and self-concept 
as the major wellsprings of behavior. 


Origins of the Phenomenological Approach 


The orientation briefly reviewed in this section has 
numerous sources that reach back to turn-of- 
the-twentieth-century European philosophy and lit- 
erature. Nonetheless, two persons, one a philoso- 
pher and the other a writer, stand out as seminal 
contributors to the modern phenomenological 
viewpoint. The German philosopher Edmund 
Husserl (1859-1938) invented a complex philoso- 
phy of phenomenology that was concerned with 
the description of pure mental phenomena. 
Husserl’s approach was heavily introspective and 
nearly inscrutable. More approachable was the 
Danish writer Soren Kierkegaard (1813-1855), 
well known for his contributions to existentialism. 
Existentialism is the literary and philosophical 
movement concerned with the meaning of life and 
an individual’s freedom to choose personal goals. 
The phenomenology of Husserl and the existen- 
tialism of Kierkegaard influenced dozens of promi- 
nent philosophers and psychologists. Vestiges of 
these early viewpoints are evident in virtually every 
contemporary phenomenological personality the- 
ory (Maddi, 2000). 


Carl Rogers, Self-Theory, 
and the Q-Technique 


The most influential phenomenological theorist 
was Carl Rogers (1902-1987). His contributions to 
personality theory, known as self-theory, are exten- 
sive and generally well appreciated by students of 
psychology (Rogers, 1951, 1961, 1980). But it is 
also true, albeit little recognized, that Rogers 
helped shape a small part of psychological testing 
by popularizing the Q-technique. 

The Q-technique is a procedure for studying 
changes in the self-concept, a key element in 
Rogers’s self-theory. The technique was developed 
by Stephenson (1953) but a series of studies by 
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Rogers and his colleagues served to popularize this 
measurement approach (Rogers & Dymond, 1954). 
Also known as a Q-Sort, the Q-technique is a 
generalized procedure that is especially useful for 
studying changes in self-concept.! The Q-sort 
consists of a large number of cards, each contain- 
ing a printed statement such as the following: 


I am poised 

I put on a false front 

I make strong demands on myself 
I am a submissive person 

I am likeable 


The examinee is asked to sort a hundred or so state- 
ments into nine piles, putting a prescribed number 
of cards into each, thus forcing a near-normal dis- 
tribution. The instructions specify that the exami- 
nee put the cards most descriptive of him or her at 
one end, those least descriptive at the opposite end, 
and those about which he or she is indifferent or un- 
decided around the middle of the distribution. The 
required distribution might look like this: 


Least Like Me 
Pile No. a Tr RR pra 
No. ofcards T 4 11° 217 26 21 


Most Like Me 
# byes Beh) 
Te ai | 


The nature of the items is determined by the 
needs of the researcher or practitioner. Rogers used 
a set of items devised by Butler and Haigh (Rogers 
& Dymond, 1954, chap. 4) to tap the self-concept. 
These statements were taken at random from avail- 
able therapeutic protocols; their Q-sort items repre- 
sented actual client statements, reworded for clarity. 
But a special virtue of the Q-technique is that other 
researchers or practitioners are free to craft their 
own items. For example, Marks and Seeman (1963) 
used a psychodynamic perspective in devising 


1. The Q-technique has additional applications as well. Marks 
and Seeman (1963) employed Q-sorts by therapists to describe 
patients with specific MMPI profiles. Bem and Funder (1978) 
recommend a Q-sort to derive a profile of characteristics asso- 
ciated with successful performance of a specific task. Persons 
whose self-descriptions match the derived profile can be pre- 
dicted to succeed at the selected task. 


items for the therapist description of patient groups. 
Examples of their items include the following: 


Utilizes acting out as a defense mechanism 

Tends to be flippant in both word and gesture 

Genotype has paranoid features 

Appears to be poised, self-assured, socially at 
ease 

Exhibits depression (manifest sad mood) 


Scoring a Q-sort is usually a matter of compar- 
ing or correlating the distribution of items against 
an established norm. For example, well-adjusted 
persons might be asked to sort the items so as to de- 
rive an average pile placement number (ranging 
from 1 to 9) for each item. An individual examinee 
would be considered more- or less-adjusted ac- 
cording to the resemblance between his or her sort- 
ings and the average sorting for adjusted persons. 
We will refer the reader to Block (1961) for details. 

Another way to use the Q-sort is to compare an 
examinee’s self-sort with his or her ideal sort. 
Rogers used the discrepancy between these two 
sortings as an index of adjustment. His subjects 
were required to sort the items twice, according to 
the following instructions: 


1. Self-sort. Sort these cards to describe yourself as 
you see yourself today, from those that are least 
like you to those that are most like you. 

2. Ideal sort. Now sort these cards tò describe your 
ideal person—the person you would most like 
within yourself to be (Rogers & Dymond, 1954). 


Using the item pile numbers, Rogers then correlated 
the two sorts for each subject separately. Consider 
what these data mean: If the self-sort and the ideal 
sort are highly similar, the correlation of Q-sort data 
will approach 1.0; if the two sorts are opposite one 
another, the correlation will approach -1.0. Of 
course, most sorts will be somewhere in between 
but typically on the positive side. Butler and Haigh 
found that psychotherapy clients increased their 
congruence between self and ideal (Rogers & Dy- 
mond, 1954, chap. 4). Even so, adjusted control sub- 
jects possessed a greater congruence (Table 13.2). 
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TABLE 13.2 Average Self-Ideal Correlations for Client and Control Groups 


Client Group (N = 25) 
Control Group (N = 16) 


Precounseling Postcounseling Follow-Up 
-.01 36 132 
58 59 





Source: Based on Rogers, C. R., & Dymond, R. F. (Eds.). (1954). Psychotherapy and personality change: 
Co-ordinated research studies in the client-centered approach. Chicago: University of Chicago Press. 






| | BEHAVIORAL AND SOCIAL 
| LEARNING THEORIES 


Behavioral and social learning theories have their 
origins in laboratory studies on operant learning 
and classical conditioning. A fundamental assump- 
tion of all behavioral theorists is that many of the 
behaviors that make up personality are learned. To 
understand personality, then, we must know about 
the learning history of the individual. Behavioral 
theorists also believe that the environment is of 
supreme importance in shaping and maintaining 
behavior. Behavioral inquiry therefore seeks to 
identify the specific components of the current en- 
vironment that are controlling a person’s behavior. 
The behavioral approach to personality has pro- 
duced a variety of direct assessment methods, 
which we discuss in the next chapter. 

Behavioral theorists disagree mainly on the 
role that cognitions play in determining behavior. 
Cognitions are inferred mental processes such as 
problem solving, judging, or reasoning. Radical be- 
haviorists believe that resorting to mentalistic ex- 
planations of any kind is futile: “When what a 
person does is attributed to what is going on inside 
him, investigation is brought to an end” (Skinner, 
1974). By contrast, social learning theorists make 
cautious reference to cognitions in explaining what 
it is, specifically, that a person learns. A social 
learning theorist might argue that we learn expec- 
tations or rules about the environment, not just 
stimulus and response connections. 

Modern social learning theory can be viewed as 
a cognitive variant of the strict behaviorism that 
was dominant in U.S. psychology early in the twen- 
tieth century. Social learning theorists accept the 


Skinnerian premise that external reinforcement is 
an important determinant of behavior. But they also 
maintain that cognitions have a critical influence on 
our actions as well. For example, Rotter (1972) has 
popularized the view that our expectations about 
future outcomes are the primary determinants of 
behavior. The probability that a person will behave 
self-assertively, for example, depends upon his or 
her expectations about the likely results of self- 
assertiveness. If the expected outcome is valued by 
the person, the behavior is more likely. Of course, 
expectations are a function of the person’s history 
of reinforcement, so Rotter’s social learning per- 
spective is similar to the behavioral viewpoint. But 
the implication of social learning theory is that be- 
havior is the result of a belief, in particular, a belief 
that the behavior will result in a desired outcome. 
Thus, cognitions are assumed to affect actions. 

Based on his social learning views, Rotter (1966) 
developed the Internal-External (I-E) Scale, an in- 
teresting measure of internal versus external locus of 
control. The construct of locus of control refers to 
the perceptions that individuals have about the 
source of things that happen to them. In particular, 
the I-E Scale seeks to assess the examinee’s gener- 
alized expectancies for internal versus external con- 
trol of reinforcement. The purpose of the I-E Scale 
is to determine the extent to which the examinee be- 
lieves that reinforcement is contingent upon his/her 
behavior (internal locus of control) as opposed to the 
outside world (external locus of control). The in- 
strument is a forced-choice self-report inventory. For 
each item, the examinee chooses the single statement 
(from a pair) with which he/she more strongly con- 
curs. Items resemble the following: 
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In general, most people get the respect they 
deserve. 


OR 


In reality, a person’s worth often passes unrec- 
ognized. 


For the preceding item, the first alternative indicates 
an internal locus of control, whereas the second al- 
ternative signifies an external locus of control. The 
balance of internal to external responses determines 
the overall score on the scale. The I-E Scale is a re- 
liable and valid instrument that has stimulated a 
huge body of research on the nature and meaning of 
locus of control and related variables. Research in- 
dicates that locus of control has a strong relation- 
ship to occupational success, physical health, 
academic achievement, and numerous other vari- 
ables. As the reader might suspect, an internal locus 
of control generally predicts a more positive out- 
come than an external locus of control. The inter- 
ested reader can consult Lefcourt (1991) and Wall, 
Hinrichsen, and Pollack (1989) for further details. 
Important contributions to social learning theory 
have also been made by Albert Bandura. In his early 
studies, Bandura examined the role of observational 
learning and vicarious reinforcement in the devel- 
opment of behavior (Bandura, 1965, 1971; Bandura 
& Walters, 1963). More recently, he has proposed 
that perceived self-efficacy is a central mechanism 
in human action (Bandura, 1982; Bandura, Taylor, 
Ewart, Miller, & DeBusk, 1985). Self-efficacy is a 
personal judgment of “how well one can execute 
courses of action required to deal with prospective 
situations” (Bandura, 1982). The concept of self- 
efficacy is useful in explaining why correct knowl- 
edge does not necessarily predict efficient action. For 
example, two boys may be equally convinced that a 
garden snake in the bathtub presents no hazard, but 
one will pick it up while the other runs out the door. 
These differences in behavior illustrate the role of 
self-referential thought as a mediator between knowl- 
edge and action. The boy who ran out the door did not 
believe he could deal with the situation effectively. 
He had little perceived self-efficacy for snake han- 
dling. Bandura would argue that the primary deter- 
minant of the boy’s behavior is a self-judgment about 


personal capabilities. Cognitions are therefore as- 
sumed to be a major determinant of behavior. 
Bandura has developed an interesting instru- 
ment for the assessment of self-efficacy expectan- 
cies (Bandura, Taylor, Ewart, Miller, & BeBusk, 
1985). For a variety of situations that might arouse 
anxiety, annoyance, or anger, the examinee checks 
whether he or she “can do” the task, and also rates 
the degree of confidence using a number from 10 
to 100. The format of the checklist is as follows: 


10 20 30 40 50 60 70 80 90. 100 


Quite Moderately Certain 
Uncertain Certain 
Can Do Confidence 
Go to a party at which 
there is no one you know. ____» un 
Complain about poor food 


at a restaurant. 








Bandura’s instrument is essentially a criterion-ref- 
erenced tool for use in psychotherapy and research. 


TRAIT CONCEPTIONS 
OF PERSONALITY 


A trait is any “relatively enduring way in which one 
individual differs from another” (Guilford, 1959). 
Psychologists developed the concept of trait from 
the ways people describe other people in everyday 
life. As language evolved, people found words to 
portray the consistencies and differences they en- 
countered in their daily interactions with others. 
Thus, when we say one person is sociable and an- 
other is shy we are using trait names to describe con- 
sistencies within individuals and also differences 
between them (Goldberg, 1981a; Fiske, 1986). 
Trait conceptions of personality have been 
enormously popular throughout the history of psy- 
chological testing, so the coverage here is neces- 
sarily selective. We will review three prominent and 
influential positions from the dozens of trait theo- 
ries that have been proposed. These approaches dif- 
fer primarily in terms of whether traits are split off 
into finely discriminable variants or grouped to- 
gether into a small number of broad dimensions: 
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1. Cattell’s factor-analytic viewpoint identifies 16 
to 20 bipolar trait dimensions. 

2. Eysenck’s trait-dimensional approach coalesces 
dozens of traits into two overriding dimensions. 

3. Goldberg and others have sought a modern syn- 
thesis of all trait approaches by proposing a five- 
factor model of personality. 


For readers who desire a more detailed discussion 
of this topic, Pervin (1993) and Wiggins (1997) 
provide an excellent review of trait approaches to 
personality theory. 


Cattell’s Factor-Analytic Trait Theory 


Cattell (1950, 1973) refined existing methods of 
factor analysis to help reveal the basic traits of per- 
sonality. He referred to the more obvious aspects of 
personality as surface traits. These would typi- 
cally emerge in the first stages of factor analysis 
when individual test items were correlated with 
each other. For example, true-false items such as “I 
enjoy a good prize fight,” “Getting stuck behind a 
slow driver really bothers me,” and “It’s important 
to let people know who is in charge” might be an- 
swered similarly by subjects, revealing a surface 
trait of aggressiveness. 

But surface traits themselves tended to come in 
clusters, as revealed by Cattell’s more sophisticated 
application of factor analysis. For Cattell, this was 
evidence of the existence of source traits, the sta- 
ble and constant sources of behavior. Source traits 
are therefore less visible than surface traits but are 
more important in accounting, for behavior: 

Cattell (1950) was unrivaled in his use of factor 
analysis to discover how traits were organized and 
how they were related to each other. One approach 
was to have persons rate others they knew well by 
checking various adjectives such as aggressive, 
thoughtful, and dominating from a list of 171 
choices. When the results from 208 subjects were 
subsequently factor analyzed, about 20 underlying 
personality factors or traits were tentatively identi- 
fied. Another approach was to have thousands of 
persons answer questions about themselves and 
then factor analyze their responses. Sixteen of the 





original 20 personality traits were independently 
confirmed by this second approach (Cattell, 1973). 
These 16 source traits have been incorporated 
into the Sixteen Personality Factor Questionnaire 
(16PF), a trait-based paper-and-pencil test of per- 
sonality that is discussed in the next chapter. 


Eysenck’s Trait-Dimensional Theory 


Eysenck used factor analysis to produce a parsimo- 
nious rapprochement between trait and dimensional 
approaches to personality (Eysenck & Eysenck, 
1975, 1985). According to his system, personality 
consists of two basic dimensions, introverted— 
extraverted and emotionally stable-emotionally un- 
stable. These two dimensions are presumed to be 
biologically and genetically based. Furthermore, 
the dimensions subsume numerous specific traits 
(Figure 13.1). The positions of the 32 traits corre- 
spond to the direction and amount of the two basic 
dimensions. For example, a moderately extraverted 
person who was also moderately unstable might be 
characterized by these traits: aggressive, excitable, 
changeable. An extremely introverted person who 
was also midway on the stable-unstable dimension 
might be viewed as unsociable, quiet, passive, and 
careful. Eysenck’s trait-dimensional theory is in- 
corporated in his personality inventory, the Eysenck 
Personality Questionnaire, which we review in the 
next chapter. 


The Five-Factor Model of Personality 


The five-factor model of personality has its origins 
in a review chapter by Goldberg (1981b). In his 
analysis of factor-analytic trait research, Goldberg 
identified several consistencies, which he referred 
to as the “Big Five” dimensions. Although re- 
searchers have used slightly different terms for 
these factors, the most common labels are 


Neuroticism 
Extraversion 

Openness to Experience 
Agreeableness 
Conscientiousness 
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EMOTIONALLY 
UNSTABLE (NEUROTIC) 









Moody 
Anxious 
Rigid 


Sober 
Pessimistic 
Reserved 
Unsociable 
Quiet 
INTROVERTED 


Phlegmatic 
















Sanguine 













EMOTIONALLY 
STABLE 





Rearranging the factors yields a simple acronym: 
OCEAN. The five-factor model is rapidly becom- 
ing the consensus model of personality, Support for 
the five-factor approach comes from several 
sources, including factor analysis of trait terms in 
language and the analysis of personality from an 
evolutionary perspective. Following, we discuss 
these perspectives. 

The use of trait terms in the analysis of person- 
ality is based upon the fundamental lexical hy- 
pothesis. The essential point of this hypothesis is 
that trait terms have survived in language because 
they convey important information about our deal- 
ings with others: 


The variety of individual differences is nearly 
boundless, yet most of these differences are in- 
significant in people’s daily interactions with oth- 
ers and have remained largely unnoticed. Sir 
Francis Galton may have been among the first sci- 
entists to recognize explicitly the fundamental lexi- 
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Source: Reprinted with permission from 
Eysenck, H. J., & Eysenck, M. W. (1985). Per- 
sonality and individual differences: A natural 
science approach. New York: Plenum. 


cal hypothesis—namely that the most important in- 
dividual differences in human transactions will 
come to be encoded as single terms in some or all 
of the world’s languages. (Goldberg, 1990) 


When trait terms in English are distilled down to a 
reasonably distinct and nonoverlapping set of ad- 
jectives, a few hundred characteristics typically 
emerge (Allport, 1937). For decades, researchers 
have been asking individuals to rate themselves or 
others on these or similar traits. When these ratings 
are subjected to factor analysis, the “Big Five” di- 
mensions previously listed usually appear in one 
guise or another. In sum, a mounting body of re- 
search indicates that the five-factor model captures 
a valid and useful representation of the structure of 
human traits. 

The five-factor approach also possesses evolu- 
tionary plausibility. Specifically, the five factors of 
personality previously listed capture individual 
differences that relate to such basic evolutionary 
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functions as survival and reproductive success 
(Buss, 1997; Pervin, 1993). Goldberg (1981b) has 
theorized that people implicitly ask the following 
questions in their interactions with others: 


1. Is X active and dominant or passive and submis- 
sive? (Can I bully X or will X try to bully me?) 

2. Is X agreeable (warm and pleasant) or disagree- 
able (cold and distant)? 

3. Can I count on X? (Is X responsible and con- 
scientious or undependable and negligent?) 

4. Is X crazy (unpredictable) or sane (stable)? 

5. Is X smart or dumb? (How easy will it be for me 
to teach X?) 


Directly or indirectly, each of these evaluations has 
a bearing upon survival and reproductive success. 
For example, point 3 (conscientiousness) involves 
a trait that might ensure group survival in a hostile 
world. A person low on this trait (undependable) 
would be a poor choice for guarding the food sup- 
ply. The ability to discern conscientiousness in oth- 
ers therefore has adaptive value. Not surprisingly, 
the five points previously listed correspond to the 
five-factor personality model. 

The five-factor model of personality has inspired 
several personality scales and other systems for as- 
sessment (deRaad & Perugini, 2002). For example, 
Costa and McCrae have developed two personality 
tests based upon the five-factor model (Costa, 1991; 
McCrae & Costa, 1987). The Revised NEO Person- 
ality Inventory (NEO-PI-R) contains 240 items rated 
on a five-point scale. In addition to the five major 
domains of personality, the inventory measures six 
specific traits (called facets) within each domain. A 
shortened 60-item version known as the NEO Five- 
Factor Inventory (NEO-FFI) also is available. Trull, 
Widiger, Useda, and others (1998) have published a 
semistructured interview for the assessment of the 
five-factor model of personality. These tests are dis- 
cussed in the next chapter. 


Comment on the Trait Concept 


The challenge faced by trait theorists is that psy- 
chologists have long known that thousands of trait 
names can be found in any standard English dic- 


tionary. For example, in an early and influential 
study, Allport and Odbert (1936) tallied over 
18,000 trait names. This is obviously too many to 
be useful in any theory of personality or testing, so 
theorists are required to search for a smaller, more 
manageable number of basic traits. Until recently, 
there was no consensus whatever on the number of 
fundamental traits. Some theorists proposed two or 
three overriding trait factors, whereas others di- 
vided the personality domain into sixteen or twenty 
trait dimensions. Many personality theorists—per- 
haps a majority—now concede that the five factors 
previously noted (Neuroticism, Extraversion, 
Openness, Agreeableness, Conscientiousness) pro- 
vide a parsimonious and useful way to look at per- 
sonality. But this model is very recent, and it will 
take time to confirm its utility. For example, there 
is still debate about whether Openness to Experi- 
ence belongs on the list of fundamental dimensions 
of personality (Digman, 1990). Also, why is Intel- 
lect not included in the five-factor model? 

All trait approaches to personality share certain 
problems in common. First, there is disagreement 
whether traits cause behavior or merely describe be- 
havior (Fiske, 1986). It can be persuasively argued 
that invoking traits as causes is an empty form of cir- 
cular reasoning. For example, a person with ex- 
tremely high standards might be said to possess the 
trait of perfectionism. But when asked to explain 
what is meant by perfectionism, we invariably end 
up referring to a pattern of extremely high standards. 
Thus, when we assert that someone is perfectionis- 
tic, are we really doing anything more than provid- 
ing a short-hand description of their past behavior? 
Miller (1991) has voiced this criticism of the five- 
factor approach, noting that the model merely de- 
scribes psychopathology but does not explain it. 

A second problem with traits is their apparently 
low predictive validity. Mischel (1968) is credited 
with the first effective disparagement of the trait 
concept in his influential book Personality and As- 
sessment. He stated that “while trait theory predicts 
behavioral consistency, it is behavior inconsistency 
that is typically observed” (Mischel, 1968). In a 
wide-ranging review of existing research, Mischel 
noted that trait scales produced validity coefficients 
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with an upper limit of r = .30. He coined the term 
personality coefficient to describe these low cor- 
relations. Undoubtedly significant for large samples 
of subjects, correlations of r = .30 aré of minimal 
value in the prediction of individual behavior. 
Trait researchers responded to Mischel’s attack 
by refining and limiting the trait concept. Re- 
searchers sought to identify subgroups of persons 
whose behavior could be accurately predicted on the 
basis of trait scores and also attempted to distin- 


guish the kinds of situations in which behavior is 
largely determined by traits (e.g., Mischel, Shoda, 
& Mendoza-Denton, 2002; Muris, Mayer, & Mer- 
ckelbach, 1998; Wasylkiw & Fekken, 2002). These 
efforts met with modest success, raising the valid- 
ity of some trait questionnaires—in some contexts 
with some persons—substantially beyond the omi- 
nous r = .30 barrier posited by Mischel (1968). But 
gone forever are the days of simplistic, generalized 
assertions such as “trait X predicts behavior Y.” 


SUMMARY 


1. Personality is a vague construct that we 
invoke to explain behavioral consistency within 
persons and behavioral distinctiveness between 
persons. In order to appreciate the nature of per- 
sonality tests, it is helpful to review theories of per- 
sonality. 


2. Psychoanalytic theories of personality 
originated with the seminal work of Sigmund 
Freud. According to Freud’s tripartite theory of 
mind, behavior is the dynamic outcome of the 
struggle between id, ego, and superego. The id is 
completely unconscious, pleasure-oriented, and is 
the seat of all instinctual needs such as for food, 
water, sexual gratification, and avoidance of pain. 


3. Soon after birth, part of the id develops into 
the ego or conscious self. The ego is servant to the 
id but obeys the reality principle. The ego must also 
contend with the superego, the ethical component 
of personality which is modeled upon parental and 
societal standards of right and wrong. 


4. To aid in its difficult task, the ego uses de- 


fense mechanisms, which consist of a variety of un- 


conscious cognitive strategies for warding off 
anxiety. Defense mechanisms such as projection 
(attributing one’s faults to others) work because 
they distort reality. 


5. In a longitudinal interview study, Vaillant 
has shown that defense mechanisms tended toward 
greater maturity in middle age. Also, the use of 
mature defense mechanisms in young adulthood 
predicted better adult outcome as measured by in- 


dependent criteria such as marital stability, absence 
of drug problems, and the like. 


6. Type theories attempt to sort individuals 
into discrete categories or types. For example, the 
Type A coronary-prone behavior pattern consists 
of insecurity of status, hyperaggressiveness, free- 
floating hostility, and a sense of time urgency 
(hurry sickness). Type A persons—especially those 
with anger proneness or time urgency—may be at 
increased risk of coronary disease and heart attack. 


7. Phenomenological theories of personality 
emphasize the importance of immediate, personal, 
subjective experience as a determinant of behavior. 
The phenomenological viewpoint originated with 
the German philosopher Husserl and the Danish ex- 
istential writer Kierkegaard. 


8. The most influential phenomenological 
theorist was Carl Rogers, who believed that the self 
or self-concept was central to personality. Rogers 
invented the Q-sort to measure the self-concept and 
the ideal self. In a Q-sort, the examinee sorts self- 
referential statements in nine or so piles (least like 
me to most like me). 


9. A fundamental assumption of all behavioral 
and social learning theories is that many of the be- 
haviors that comprise personality are learned. Rad- 
ical behavior theorists such as Skinner see no role 
for cognitions in explaining behavior. In contrast, 
social learning theorists such as Rotter believe that 
expectations (cognitions) about environmental re- 
inforcers are the primary determinants of behavior. 
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10. Guilford defines a trait as any relatively en- 
during way in which one individual differs from an- 
other. Trait theories evolved from the ways in 
which people describe other peoplé in everyday 
life. Mischel has pointed out a major weakness of 
the trait approach: Traits possess low predictive 
validity, seldom exceeding r = .30. 

11. Cattell’s factor-analytic trait theory refers 
to the more obvious aspects of personality as sur- 
face traits, such as aggressiveness. These emerge in 
the first stages of factor analysis. Source traits— 


more important and predictive of behavior than sur- 
face traits—are revealed by the clusterings of sur- 
face traits. Cattell’s 16PF is based upon this model. 


12. The five-factor model proposes a mod- 
ern synthesis of trait approaches in terms of five 
dimensions of personality: Neuroticism, Extraver- 
sion, Openness, Agreeableness, and Conscien- 
tiousness. Costa and McCrae have developed two 
inventories based upon this approach (NEO-PI-R 
and NEO-FF1). 
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The Projective Hypothesis 
A Primer of Projective Techniques 


Association Techniques 
Completion Techniques 
Construction Techniques 
Expression Techniques 


Reprise: The Projective Paradox 
Case Exhibit 13.1  Projective Tests as Ancillary to the Interview 


Summary 


Key Terms and Concepts 


rank (1939, 1948) introduced the term projec- 

tive method to describe a category of tests for 
studying personality with unstructured stimuli. Ina 
projective test the examinee encounters vague, 
ambiguous stimuli and responds with his or her 
own constructions. Disciples of projective testing 
are heavily vested in psychoanalytic theory and its 
postulation of unconscious aspects of personality. 
These examiners believe that unstructured, vague, 
ambiguous stimuli provide the ideal circumstance 
for revelations about inner aspects of personality. 
The central assumption of projective testing is that 
responses to the test represent projections from the 
innermost unconscious mental processes of the 
examinee. We introduce this topic with some pre- 
liminary concepts and distinctions relevant to pro- 
jective testing. 


| THE PROJECTIVE HYPOTHESIS 


The assumption that personal interpretations of 
ambiguous stimuli must necessarily reflect the un- 
conscious needs, motives, and conflicts of the ex- 
aminee is known as the projective hypothesis. 
Frank (1939) is generally credited with populariz- 
ing the projective hypothesis: 
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When we scrutinize the actual procedures that may 
be called projective methods we find a wide variety 
of techniques and materials being employed for the 
same general purpose, to obtain from the subject, 
“what he cannot or will not say,” frequently be- 
cause he does not know himself and is not aware 
what he is revealing about himself through his 
projections. 
The challenge of projective testing is to decipher 
underlying personality processes (needs, motives, 
and conflicts) based on the individualized, unique, 
subjective responses of each examinee. In the sec- 
tions that follow we will examine how well projec- 
tive tests have met this portentous assignment. 





Origins of Projective Techniques 


Projective techniques date back to the nineteenth 
century. By way of quick review, Galton (1879) 
developed the first projective technique, a word 
association test. This procedure was adapted to 
testing by Kent and Rosanoff (1910) and used in 
therapy by C. G. Jung and others. Meanwhile, 
Ebbinghaus (1897) used a sentence completion test 
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as a measure of intelligence, but others soon real- 
ized the method was better suited to personality as- 
sessment (Payne, 1928; Tendler, 1930). Heavily 
influenced by psychoanalytic formulations of per- 
sonality, Rorschach published his famous inkblot 
test in 1921. In 1905, Binet invented a precursor to 
story telling or thematic apperception techniques 
when he used verbal responses to pictures as a mea- 
sure of intelligence. These and other endeavors 
form the cornerstone of modern projective testing. 


The Popularity of Projective Tests: A Paradox 


The widespread use of projective tests has contin- 
ued unabated from the early twentieth century to 
present times (Louttit & Browne, 1947; Lubin, 
Wallis, & Paine, 1971; Watkins, Campbell, & 
McGregor, 1988). Recently, Watkins, Campbell, 
Nieberding, and Hallmark (1995) surveyed more 
than 400 psychologists who practiced assessment to 
estimate the frequency of use of various prominent 
tests. They discovered that 5 of 15 most frequently 
used tests are projective techniques (Table 13.3). 

Paradoxically, from the standpoint of tradi- 
tional psychometric criteria, projective tests do not 
fare nearly as well as the objective tests discussed 
in the next chapter. The essential puzzle of projec- 
tive tests is how to explain the enduring popularity 
of these instruments in spite of their sometimes 
questionable psychometric quality. After all, psy- 
chologists are not uniformly dense, nor are they 
dumb to issues of test quality. So why do projective 
techniques persist? We return to this puzzle— 
which might be called the projective paradox— 
after we familiarize the reader with prominent 
approaches to projective testing. 


A Classification of Projective Techniques 


Lindzey (1959) has offered a classification of pro- 
jective techniques that we will follow here. Based 
on the response required, he divided projectives 
into five categories: 


e Association to inkblots or words 
e Construction of stories or sequences 


TABLE 13.3 The 15 Most Frequently Used Tests 
in the United States 


Test Rank 

Wechsler Adult Intelligence Scale-Revised 1 
Minnesota Multiphasic Personality Inventory-2 2 
Sentence Completion Methods* 3 
Thematic Apperception Test* 4 
Rorschach* 3 
Bender-Gestalt 6 
Projective Drawings* 7 
Beck Depression Inventory 8 
Wechsler Intelligence Scale for Children-III 9 
Wide Range Achievement Test-Revised 10 
Wechsler Memory Scale-Revised 11 
Peabody Picture Vocabulary Test-Revised 12 
Millon Clinical Multiaxial Inventory-II 13 
Wechsler Preschool and Primary Scale 

of Intelligence-R 14 
Children’s Apperception Test* 15 





*Denotes a projective test. Some examiners use the Bender as a 
projective test. 


Source: Adapted with permission from Watkins, C., Campbell, V., 
Nieberding, R., & Hallmark, R. (1995). Contemporary practice of 
psychological assessment by clinical psychologists. Professional 
Psychology: Research and Practice, 26, 54-60. 


e Completions of sentences or stories 

e Arrangement/selection of pictures or verbal 
choices 

« Expression with drawings or play 


Association techniques include the widely used 
Rorschach inkblot test and its. psychometrically 
superior cousin the Holtzman Inkblot Test, as well 
as word association tests. Construction techniques 
include the Thematic Apperception Test and the 
many variations upon this early instrument. Com- 
pletion techniques consist mainly of sentence 
completion tests, discussed later. Arrangement/ 
selection procedures such as the Szondi test (dis- 
cussed in the first chapter) are currently seldom 
used. Finally, expression techniques such as the 
Draw-A-Person or House-Tree-Person test are very 
popular among clinicians in spite of dubious valid- 
ity data. 


We will review prominent techniques within 
each category except the antiquated arrangement/ 
selection approaches, which are almost never used. 
However, the literature on major projective tech- 
niques is simply overwhelming, running to perhaps 
tens of thousands of articles on the Rorschach 
alone. We can suggest major trends in the research, 
but the reader will need to consult other sources for 
comprehensive reviews. 


||) ASSOCIATION TECHNIQUES 
The Rorschach 


The Rorschach consists of 10 inkblots devised by 
Herman Rorschach (1884-1922) in the early 
1900s. He formed the inkblots by dribbling ink on 
a sheet of paper and folding the paper in half, pro- 
ducing relatively symmetrical bilateral designs. 
Five of the inkblots are black or shades of gray, 
while five contain color; each is displayed on a 
white background. An inkblot ofthe type employed 
by Rorschach is shown in Figure 13.2. The Ror- 
schach is suited to persons age five and up, but is 
most commonly used with adults. 

In administering the Rorschach, the examiner 
sits by the examinee’s side to minimize body lan- 
guage communication. Administration consists of 
two phases. In the free association phase, the ex- 
aminer presents the first blot and asks, “What might 
this be?” If the examinee asks for clarification (e.g., 
“Should I use the whole blot or only part of it?”), 
the examiner always responds in a nondirective 
manner (“It’s up to you”). The test proceeds at a 
leisurely pace, so there is an implicit expectation 
that the examinee will give more than one response 
per card. However, this is not required; it is even 
permissible for the examinee to reject a card en- 
tirely, although this rarely happens. All 10 cards are 
presented in a similar manner. 

Next, the examiner begins the inquiry phase. In 
this phase the examiner asks questions to clarify the 
exact blot location of each percept and to determine 
which aspects of the blot, such as the form or color, 
played a part in the creation of the response. Based 
on the information collected during the inquiry 
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FIGURE 13.2 An Inkblot Similar to Those Found on 
the Rorschach 


phase, the examiner can then code the location, de- 
terminants, form quality, and content of each re- 
sponse according to one or more formal scoring 
systems. For example, if the examinee used the en- 
tire blot for a percept, the response is coded W 
(whole); if the form of the blot was important in the 
percept, the response is further coded F (form); if 
human movement is depicted in the percept, the re- 
sponse is coded M (movement); the use of color in 
a percept is coded C (color), CF (color/form), or FC 
(form/color), depending upon whether form is to- 
tally absent, primary, or secondary to color as a de- 
terminant. The content of the percept is also coded, 
for example, H (human), Hd (human detail), An 
(anatomy), Cg (clothing), and so on (Table 13.4). 
Proper scoring of the Rorschach requires extensive 
training and supervision; we have touched on just 
a few basic aspects here. 

Regrettably, Rorschach died before he could 
complete his scoring methods, so the systematiza- 
tion of Rorschach scoring was left to his followers. 
Five American psychologists produced overlapping 
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TABLE 13.4 Summary of Major Rorschach Scoring Criteria 


I., Location: Where on the blot was the percept located? 
W Whole Entire inkblot used 
D Common detail § Well-defined part used 
Dd__ Unusual detail Unusual part used 
S Space Percept defined by white space 
II. Determinant: What feature of the blot determined the response? 
F Form Shape or outline used 


F+ Form+ Excellent match of percept and inkblot 
F- Form- Very poor match of percept and inkblot 
M Movement Movement seen or implied in percept 
C Color Color helped determine the response 
T Texture Shading involved in the response 

III. Content: What was the percept? 
H Human Percept of a whole human form 


Hd Human detail 
Ex Explosion 


Human form incomplete in any way 
An actual explosion 


Xy X-ray X-ray of any human part; involves shading 
IV. Popular versus Original 

P Popular Response given by many normal persons 

O Original Rare and creative response 





Note: This table represents a consensus of all the major scoring systems. The list is incomplete and illus- 
trative only. Full scoring systems are very complex and allow for blends, e.g., FM, CF, WS-, Do. For ex- 
amples, see Exner, J. E., Jr. (1993). The Rorschach: A comprehensive system, Volume |. Basic foundations 
(3rd ed.). New York: Wiley. 


brain impairment, or intellectual deficit in the ex- 
aminee (Exner, 1993). The F+ percent is also con- 
sidered to be an index of ego strength, with higher 
scores indicating a greater capacity to deal effec- 
tively with stress. However, support for this con- 
jecture is mixed at best. 


but independent approaches to the test—Samuel 
Beck, Marguerite Hertz, Bruno Klopfer, Zygmunt 
Piotrowski, and David Rapaport (Erdberg, 1985). 
Predictably, the nuances of scoring vary from one 
scoring method to another. Fortunately, Exner and 
his colleagues have synthesized these earlier ap- 


proaches into the Comprehensive Scoring System 
(Exner, 1991, 1993; Exner & Weiner, 1994). The 
Comprehensive Scoring System is better grounded 
in empirical research and clearly has supplanted all 
other approaches to Rorschach scoring. 

Once the entire protocol has been coded, the ex- 
aminer can compute a number of summary scores 
that form the primary basis for hypothesizing about 
the personality of the examinee. For example, the 
F+ percent is the proportion of the total responses 
that uses pure form as a determinant. A voluminous 
literature exists on the meaning of this index, but it 
seems safe to hypothesize that when the F+ per- 
centage falls below 70 percent, the examiner should 
consider the possibility of severe psychopathology, 


Frank (1990) has emphasized that formal scor- 
ing of the Rorschach is insufficient for some pur- 
poses such as the diagnosis of schizophrenia. He 
stresses that an analysis of the patient’s thinking for 
the presence of highly personal, illogical, and 
bizarre associations to the blots is essential for psy- 
chodiagnosis. In his approach, the Rorschach is re- 
ally an adjunct to the interview, and not a test per se. 


Comment on the Rorschach 


For a variety of reasons, it is difficult to offer con- 
cise generalizations about the reliability, validity, 
and clinical utility of the Rorschach. Even simple 
questions provoke complex answers. For example, 


What is the purpose of a Rorschach evaluation? In 
successive research epochs, the Rorschach has 
been used to derive a psychiatric diagnosis, esti- 
mate prognosis for psychotherapy, obtain an index 
of primary process thinking, predict suicide, and 
formulate complex personality structures, to name 
just a few applications (Peterson, 1978). The pur- 
pose of the Rorschach is so ill-defined that some 
adherents even decline to regard it as a test, prefer- 
ring instead to call it a method for generating in- 
formation about personality functioning (Weiner, 
1994). When the purpose of an instrument is un- 
clear, objective research on its psychometric attrib- 
utes is both risky and difficult. Worse yet, objective 
research may be pointless since supporters will ig- 
nore contrary findings and detractors don’t use the 
test anyway. 

A study by Albert, Fox, and Kahn (1980) on the 
susceptibility of the Rorschach to faking is typical of 
research on this instrument. We remind the reader 
that thousands of research studies exist in the litera- 
ture, including many with positive, supportive find- 
ings (e.g., Hilsenroth, Fowler, Padawer, & Handler, 
1997; Smith, Gacono, & Kaufman, 1997; Weiner, 
1996). But the mixed results reported by Albert, Fox, 
and Kahn (1980) are not unusual. They submitted 
the Rorschach protocols of 24 persons to a panel of 
experts, asking for psychiatric diagnoses of each ex- 
aminee. The 24 Rorschach protocols consisted of re- 
sults from four groups of six persons each: 


Mental hospital patients with a diagnosis of para- 
noid schizophrenia 

Uninformed fakers given instructions to fake the 
responses of a paranoid schizophrenic 
Informed fakers who listened to a detailed au- 
diotape about paranoid schizophrenia 

Normal controls who took the test under standard 
instructions 


The uninformed fakers, informed fakers, and nor- 
mal controls were students who had passed an 
MMPI screening and were judged reasonably nor- 
mal during interview. Each protocol was rated by 
six to nine judges, all fellows of the Society for 
Personality Assessment. The judges were told to 
provide a psychiatric diagnosis as well as other in- 
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formation not reported here. The judges were not 
informed as to the purpose of the study, but were 


‘told to assess whether any profiles appeared to be 


malingered. 

The informed fakers must have done an excel- 
lent job, for they were more likely to be diagnosed 
psychotic than the real patients themselves (72 
percent versus 48 percent, respectively). The un- 
informed fakers were fairly convincing, too, with 
a 46 percent rate of diagnosed psychosis. The nor- 
mal controls were diagnosed as psychotic 24 per- 
cent of the time. Granted that the diagnostic 
challenge in this study was immense, it is still dis- 
turbing to find that the expert judges rated 24 per- 
cent of the normal protocols as psychotic, while 
correctly identifying psychosis in only 48 percent 
of the actual psychotic protocols. A more recent 
study by Netter and Viglione (1994) also con- 
cluded that the Rorschach was susceptible to the 
faking of psychosis. 

Although there are noteworthy exceptions in 
Rorschach testing, a substantial number of studies 
point to low reliability and a general lack of pre- 
dictive validity (Carlson, Kula, & St. Laurent, 
1997; Peterson, 1978; Lanyon, 1984; Wood, 
Nezworski, & Stejskal, 1996; Lilienfeld, Wood, & 
Garb, 2000). In a meta-analytic review, Garb, 
Florio, and Grove (1998) concluded that the 
Rorschach explained a dismal 8 to 13 percent of the 
variance in client characteristics, as compared to 
the MMPI, which explained 23 to 30 percent of the 
variance. On the positive side, recent studies based 
upon improvements in scoring offered by the Exner 
approach are more optimistic in outcome (see 
Exner, 1995; Exner & Andronikof-Sanglade, 1992; 
Meyer, 1997; Ornberg & Zalewski, 1994; Piotrow- 
ski, 1996). Even so, the Rorschach has not yet 
gained the status of scientific respectability enjoyed 
by many other personality tests, and perhaps it 
never will. 


Holtzman Inkblot Technique 


Wayne H. Holtzman sought to overcome the major 
limitations in the Rorschach by developing a com- 
pletely new technique using more inkblots with 
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simplified procedures for administration and scor- 
ing. In the Holtzman Inkblot technique, the exam- 
inee is limited to one response per card, but views 
a series of 45 cards. Each response is followed with 
a very simple twofold question: Where was the per- 
cept represented in the blot, and what about the blot 
suggested the percept? 

The HIT comes in two carefully constructed 
parallel forms. The existence of parallel forms is in- 
valuable for test-retest studies, since examinees 
often remember their responses to a card and there- 
fore mechanically offer the same answer when 
retested. The 45 responses to the HIT are scored for 
22 different variables derived from early Rorschach 


scoring systems. The HIT scoring variables are de- 
scribed in Table 13.5. 

The scoring system for the HIT is highly reli- 
able, and the standardization of the instrument ap- 
pears to be adequate. When well-trained scorers are 
used, interscorer agreement for the different cate- 
gories is .95 to 1.00 for most categories; only Pen- 
etration and Integration fall below these standards. 
Split-half reliabilities are also acceptable, with me- 
dian values in the .70s and .80s. Test-retest stabil- 
ity with parallel forms is generally fair, although 
some categories (Location, with r of .81) perform 
better than others (Popular, with r of .36). Per- 
centile norms for each scoring category are re- 


TABLE 13.5 Names and Descriptions of the Holtzman Inkblot Technique Variables 


Reaction Time 
response. 


Rejection 


Time in seconds from the presentation of the inkblot to the beginning of the primary 


Subject fails to report anything or returns the inkblot to the examiner. 


Location 

Space 

Form Definiteness 
Form Appropriateness 
Color 

Shading 


Movement 


Pathognomic Verbalization 
Integration 


Content Scores 
Anxiety 
Hostility 
Barrier 
Penetration 


Balance 


Popular 





Source: Based on Holtzman, W. H. (1961). Guide to administration and scoring: Holtzman Inkblot Technique. New York: The Psychologi- 


cal Corporation. 


Scored on a 3-point system: O—whole blot, 1—large area, 2—smaller area. 
Scored when there is a true figure-ground reversal; the white part is the figure. 


Scored on a 5-point system from 0 (formless concept—e.g., paint splatter) 

to 4 (highly formed concept—e.g., centaur). 

Goodness of fit of the concept to the form of the inkblot; 0—poor, 1—fair, 
2—good. 

Color is a primary determinant, usually mentioned by the subject; scored 0 to 3. 
Subject refers to shading (fuzziness, texture) as a determinant; scored 0 to 2. 


Scored when the response implies energy or dynamic movement quality; 
scored 0 to 4. 


Incoherent, queer, absurd, self-referential, etc., verbalizations to cards. 


Scored 1 if two or more blot elements are effectively integrated in the response; 
otherwise scored 0. 


Each category (Human, Animal, Anatomy, Sex, Abstract) is scored 0 to 2 based 
on absence, partial, or full presence of the concept. 


Each response is scored 0 to 2 for signs of anxiety (e.g., dark and dangerous cave). 
Each response is scored 0 to 3 for signs of hostility (e.g., mangled butterfly). 


Barrier refers to any protective covering, membrane, shell, or skin that might be 
symbolically related to body-image boundaries; 1 if present, 0 if absent. 


Scored 1 if the concept is symbolic of an examinee’s feeling that his or her body 
exterior can be easily penetrated; otherwise 0. 


Scored 1 if examinee refers to presence or absence of symmetry in the design; 
otherwise scored 0. 


Scored 1 if the response is common, observed in 1 of 7 normative protocols. 


ported separately for college students (N = 206), 
average adults (N = 252), seventh graders (N = 
197), elementary schoolchildren (N = 132), five- 
year-olds (N = 122), chronic schizophrenics (N = 
140), depressed patients (N = 90), and persons with 
mental retardation (N = 100). 

The validity of the HIT has been addressed in 
several hundred research studies reporting on the re- 
lationships between HIT scores and independent 
measures of personality (Hill, 1972; Holtzman, 
1988; Swartz, Reinehr, & Holtzman, 1983). In gen- 
eral, the relationships are modest but supportive of 
HIT validity, especially as an aid to psychodiag- 
nosis. Holtzman (2000, 2002) describes the cross- 
cultural applications of the HIT and notes that the 
test has been featured in more than 800 publications. 

A recent variant of the HIT requires two re- 
sponses to each of a carefully selected subset of 25 
cards from Form A. Called the HIT 25 to distin- 
guish it from the standard HIT, this new test holds 
exceptional promise for helping make the diagno- 
sis of schizophrenia. Using completely objective 
scoring criteria and simple decision rules, the HIT 
25 correctly classified 26 of 30 schizophrenics and 
28 of 30 normal college students (Holtzman, 1988). 
The decision criteria consist of four rules for nor- 
mal findings scored +1 each, and 13 rules for schiz- 
ophrenic findings scored —1 each. The total results 
are summed algebraically, yielding the “normalcy” 
score. This score is the basis for simple diagnostic 
decisions. Scores above zero suggest normalcy, 
whereas scores below zero indicate schizophrenia; 
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a score of zero is indeterminate. The HIT 25 looks 
promising but cross-validation studies would be es- 
pecially welcome. 


| COMPLETION TECHNIQUES 
Sentence Completion Tests 


In a sentence completion test, the respondent is pre- 
sented with a series of stems consisting of the first 
few words of a sentence, and the task is to provide 
an ending. As with any projective technique, the 
examiner assumes that the completed sentences 
reflect the underlying motivations, attitudes, con- 
flicts, and fears of the respondent. Usually, sen- 
tence completion tests can be interpreted in two 
different ways: subjective-intuitive analysis of the 
underlying motivations projected in the subject’s 
responses, or objective analysis by means of scores 
assigned to each completed sentence. 

An example of a sentence completion test is 
shown in Figure 13.3. This test is quite similar to 
existing instruments in that the stems are very short 
and restricted to a small number of basic themes. 
The reader will notice that three topics reoccur in 
this short test (the respondent’s self-concept, 
mother, and father). In this manner the examinee 
has multiple opportunities to reveal underlying mo- 
tivations about each topic. Of course, most sentence 
completion tests are much longer—anywhere from 
40 to 100 stems—and contain more themes—any- 
where from 4 to 15 topics. 





Directions: Finish these sentences to indicate how you feel. 


. My best characteristic is 

My mother 

. My father 

My greatest fear is 

The best thing about my mother was 
The best thing about my father was 
I am proudest about 

. Tonly wish my mother had 

. Tonly wish my father had 


er naurunn 





FIGURE 13.3 
Example of a Short 
Sentence Completion Test 
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Dozens of sentence completion tests have been 
developed; most are unpublished and unstandard- 
ized instruments produced to meet a specific clini- 
cal need. Some representative sentence completion 
tests in current use are outlined in Table 13.6. 
Of these instruments, Loevinger’s Washington 
University Sentence Completion Test is the most 
sophisticated and theory-bound (e.g., Weiss, 
Zilberg, & Genevro, 1989). However, the Rotter 
Incomplete Sentences Blank has the strongest em- 
pirical underpinnings and is the most widely used 
in clinical settings. We examine this instrument in 
more detail. 


TABLE 13.6 Brief Outline of Representative 
Senténce Completion Tests 





Sentence Completion Series 

Psychological Assessment Resources 

The SCS consists of 50 sentence stems designed to aid 
the clinician in identifying underlying concerns and 
specific areas of client distress. A unique feature of this 
instrument is the publication of eight different forms, 
parallel in content, which allow for repeated testing. 


Forer Structured Sentence Completion Test 
Western Psychological Services 

This instrument is available in separte forms for men, 
women, adolescent boys, and adolescent girls. Each 
form contains 100 sentence stems designed to cover 
attitude—value systems, evasiveness, and defense 
mechanisms. 


Geriatric Sentence Completion Form 
Psychological Assessment Resources 


The GSCF is a 30-item form specifically developed for 
use with older adult clients. The GSCF elicits personal 
responses to four content domains: physical, psycho- 
logical, social, and temporal orientation. The test man- 
ual includes a number of clinical case illustrations. 


Washington University Sentence Completion Test, 
Privately published by Loevinger 

The WUSC uses separate forms for men, women, and 
younger male and female subjects. This test is highly 
theory-bound; responses are classified according 

to seven stages of ego development: presocial and 
symbiotic, impulsive, self-protective, conformist, 
conscientious, autonomous, integrated. 





Rotter Incomplete Sentences Blank 


The Rotter Incomplete Sentences Blank (RISB) 
consists of three similar forms—high school, col- 
lege, and adult—each containing 40 sentence stems 
written mostly in the first person (Rotter & Raf- 
ferty, 1950; Rotter, Lah, & Rafferty, 1992). Al- 
though the test can be subjectively interpreted in 
the usual manner through qualitative analysis of 
needs projected in the subject’s responses, it is the 
objective and quantitative scoring of the RISB that 
has drawn the most attention. 

In the objective scoring system each completed 
sentence receives an adjustment score from 0 (good 
adjustment) to 6 (very poor adjustment). These 
scores are based initially on the categorizing of 
each response as follows: 


e Omission—no response or response too short to 
be meaningful 

© Conflict response—indicative of hostility or un- 
happiness 

e Positive response—indicative of positive or 
hopeful attitude 

e Neutral response—declarative statement with 
neither positive nor negative affect 


Examples of the last three categories include: 


I hate . . . the entire world. (conflict response) 
The best . . . is yet to come. (positive response) 
Most girls . . . are women. (neutral response) 


Conflict responses are scored 4, 5, or 6, from 
lowest to highest degree of the conflict expressed. 
Positive responses are scored 2, 1, or 0, from least 
to most positive response. Neutral responses and 
omissions receive no score. The manual gives ex- 
amples of each scoring category. The overall ad- 
justment score is obtained by adding the weighted 
ratings in the conflict and positive categories. The 
adjustment score can vary from 0 to 240, with 
higher scores indicating greater maladjustment. 

The reliability of the adjustment score is ex- 
ceptionally good, even when derived by assistants 
with minimal psychological expertise. Typically, 
interscorer reliabilities are in the .90s and split-half 
coefficients are in the .80s (Rotter et al., 1992; Rot- 


ter, Rafferty, & Schachtitz, 1965). The validity of 
this index has been investigated in numerous stud- 
ies using the RISB as a screening device with a 
“maladjustment” cutoff score. For example, a cut- 
off score of 135 has been found to correctly screen 
delinquent youths 60 percent of the time while 
identifying nondelinquent youths correctly 73 per- 
cent of the time (Fuller, Parmelee, & Carroll, 
1982). The same cutoff identifies heavy drug users 
80 to 100 percent of the time (Gardner, 1967). 
These and similar findings support the construct va- 
lidity of the adjustment index, but also indicate that 
classification rates are much lower than needed for 
individual decision making or effective screening. 
It also appears that the norms for the adjustment 
index are outdated. Lah and Rotter (1981) found 
that current student scores differ significantly from 
those obtained in the original study by Rotter and 
Rafferty (1950). Lah (1989) and Rotter et al. (1992) 
provide new normative, scoring, and validity data 
for the RISB. 

As discussed by P. Goldberg (1965), the sim- 
plicity of the single adjustment score is both the 
test’s strength and weakness. True, the test provides 
a quick and efficient method for obtaining an over- 
all index of how respondents are functioning on a 
day to day basis. However, a single score cannot 
possibly capture any nuances of personality func- 
tioning. In addition, the RISB is subject to the same 
types of bias as other self-report measures, namely, 
the information will reflect mainly what the respon- 
dent wants the examiner to know (Phares, 1985). 


Rosenzweig Picture Frustration Study 


Often considered a semiprojective technique, the 
Rosenzweig Picture Frustration Study (P-F Study) 
requires the examinee to produce a verbal response 
to highly structured verbal-pictorial stimuli. The P- 
F Study comes in three forms—child, adolescent, 
and adult—each consisting of 24 comic-strip pic- 
tures depicting a frustrating circumstance (Rosen- 
zweig, 1977, 1978a). Each picture contains two 
people, with the person on the left uttering words 
that provoke or describe a frustrating situation to the 
person on the right (Figure 13.4). The examinee is 
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This is a fine time 
to have lost the 
keys. 


FIGURE 13.4 Sample Item from the Rosenzweig 
Picture-Frustration Study 
Copyright © by Saul Rosenzweig. Reproduced by permission. 











requested to indicate, by writing in the balloon 
above the frustrated person’s head, the first verbal 
response that comes to mind as being uttered by the 
anonymous cartoon figure. In the case of younger 
examinees, the examiner writes down the subject’s 
response. 

The purpose of the P-F Study is to assess the 
examinee’s characteristic manner of reacting to 
frustration. Frustration is defined as occurring 
whenever the organism encounters an obstacle or 
obstruction en route to the satisfaction of a need 
(Rosenzweig, 1944). In a general sense, it is well 
known that persons react to frustration with ag- 
gression. The value of the P-F Study is its multi- 
faceted conceptualization of aggression according 
to three directions and three types. The direction of 
aggression can be extraggressive, it is turned onto 
the environment; intraggressive, it is turned by the 
examinee onto the self; or imaggressive, it is evaded 
in an attempt to gloss over the frustration. The type 
of aggression can be obstacle-dominant, in which 
the barrier that occasions the frustration stands out 
in the response; ego-defensive, in which the orga- 
nizing capacity of the examinee predominates in 
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the response; or need-persistent, in which the solu- 
tion of the frustrating situation is emphasized by 
pursuing the goal despite the obstacle (Rosen- 
zweig, 1978b). It is important to point out that ag- 
gression is not necessarily a negative construct. 
Need-persistent types of aggression represent con- 
structive, sometimes creative, forms of aggression 
while ego-defensive aggression is frequently de- 
structive (of others or oneself). 

The P-F Study is scored by detecting one or two 
of the factors in each individual response. Deep in- 
terpretations are avoided; the manual contains scor- 
ing samples to aid in decision making. When the 
item scores have been tallied, the scoring blank is 
completed by computing the percentages of the 
nine scoring categories which occur in the protocol 
of the examinee. The overall types and directions of 
aggression are also tallied, resulting in 15 indices. 
In addition, a Group Conformity Rating (GCR) can 
be computed. The GCR indicates how closely the 
examinee’s responses correspond to those given 
most frequently by a norm sample. All the indices 
can be compared to results from appropriate stan- 
dardization samples. Of course, in addition to quan- 
titative scoring, responses to the P-F Study can be 
evaluated impressionistically. 

The interscorer reliability of the P-F Study is re- 
portedly in the range of .80 to .85 for well-trained, 
conscientious examiners. However, the test-retest 
stability of the instrument is somewhere between 
fair and marginal. For example, retest correlations 
for scoring categories on the adult form of the P-F 
Study range from .21 to .71, with most values in the 
.40s (Rosenzweig, 1978b). A huge body of valida- 
tional research has been summarized in several pub- 
lications (Rosenzweig, 1977, 1978b; Rosenzweig & 
Adelman, 1977). Based on the very modest relia- 
bilities of the scoring categories, we concur that the 
P-F Study is more appropriate for research than in- 
dividual assessment (Graybill & Heuvelman, 1993). 


| CONSTRUCTION TECHNIQUES 
The Thematic Apperception Test (TAT) 


The TAT consists of 30 pictures that portray a vari- 
ety of subject matters and themes in black-and- 


white drawings and photographs; one card is blank. 
Most of the cards depict one or more persons en- 
gaged in ambiguous activities. Some cards are used 
for adult males (M), adult females (F), boys (B), or 
girls (G), or some combination (e.g., BM). As a 
consequence, exactly 20 cards are appropriate for 
every examinee. 

A picture similar to those on the TAT is shown 
in Figure 13.5. In administering the TAT, the 
examiner requests the examinee to make up a 
dramatic story for each picture, telling what led up to 
the current scene, what is happening at the moment, 
how the characters are thinking and feeling, and 
what the outcome will be. The examiner writes down 
the story verbatim for later scoring and analysis. 

The TAT was developed by Henry Murray and 
his colleagues at the Harvard Psychological Clinic 
(Morgan & Murray, 1935; Murray, 1938). The test 
was originally designed to assess constructs such 
as needs and press, elements central to Murray’s 
personality theory. According to Murray, needs or- 
ganize perception, thought, and action and energize 
behavior in the direction of their satisfaction. Ex- 
amples of needs include the needs for achievement, 





FIGURE 13.5 A Picture Similar to Those on the Thematic 
Apperception Test 





affiliation, and dominance. In contrast, press refers 
to the power of environmental events to influence 
a person. Alpha press is objective or “real” external 
forces, whereas beta press concerns the subjective 
or perceived components of external forces. Mur- 
ray (1938, 1943) developed an elaborate TAT scor- 
ing system for measuring 36 different needs and 
various aspects of press, as revealed by the exami- 
nee’s stories. 

Almost as soon as Murray released the TAT, 
other clinicians began to develop alternative 
scoring systems (e.g., Dana, 1959; Eron, 1950; 
Shneidman, 1951; Tomkins, 1947). Literature on 
the administration, scoring, and interpretation of 
the TAT burgeoned extensively, as documented by 
recent reviews (Aiken, 1989, chap. 12; Groth- 
Marnat, 1997; Ryan, 1987; Weiner & Kuehnle, 
1998). By the 1950s, there was no single preferred 
mode of administration, no single preferred system 
of scoring, and no single preferred method of in- 
terpretation, a predicament that still endures today. 
Clinicians even vary the wording of the instructions 
and commonly select an individualized subset of 
TAT cards for each client. Indeed, the absence of 
standardized procedures is such that we should 
rightly regard the TAT as a method, not a test. 

It is worth mentioning that Murray’s instruc- 
tions included a statement that the TAT was “‘a test 
of imagination, one form of intelligence” and fur- 
ther stipulated: 


I am going to show you some pictures, one at a 
time; and your task will be to make up as dramatic a 
story as you can for each. Tell what has led up to 
the event shown in the picture, describe what is 
happening at the moment, what the characters are 
feeling and thinking; and then give the outcome. 
Speak your thoughts as they come to your mind. Do 
you understand? Since you have fifty minutes for 
ten pictures, you can devote about five minutes to 
each story. Here is the first picture. (Murray, 1943) 


Currently, clinicians downplay the emphasis upon 
imagination and intelligence when giving instruc- 
tions. Surely, this omission must influence the qual- 
ity of the stories produced. 

Even though more than a dozen scoring sys- 
tems have been proposed, interpretation of the TAT 
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is usually based upon a clinical-qualitative analysis 
of the story productions. A central consideration 
harks back to Murray’s “hero” assumption. Ac- 
cording to this viewpoint, the hero is the protago- 
nist of the examinee’s story. It is assumed that the 
examinee clearly identifies with this character and 
projects his or her own needs, strivings, and feel- 
ings onto the hero. Conversely, thoughts, feelings, 
or actions avoided by the hero may represent areas 
of conflict for the examinee. A specific example 
will help clarify these points. Consider the response 
to Card 3BM given by a depressed examinee!: 


Looks like... I can’t tell if it’s a girl or boy. Could 
be either. I guess it doesn’t matter. This person just 
had a hard physical workout. I guess it’s a her. 
She’s just tired. No trauma happened or anything. 
She was sitting around a table with friends and she 
got real tired. She’s not in a health danger or any- 
thing. These are her keys. Her friends drag her 
back to her room and put her to bed. She’s O.K. the 
next day. No trauma. She’s tired physically, not 
mentally. (Ryan, 1987) 


What stands out in this response is the repetitive de- 
nial of danger or trauma. But later in the testing; the 
denial of trauma is no longer maintained. Read how 
the examinee responded to the blank card, relating 
a story of a young man, traumatized at school, who 
takes his car down to the river: 


He sees the bridge, he’s really down. He remem- 

bers that he’s heard stories about people jumping 
off and killing themselves. He could never under- 
stand why they did that. Now he understands, he 

jumps and dies . . . he should have waited "cause 

things always get better sometime. But he didn’t 

wait, he died. (Ryan, 1987) 


Most clinicians would conclude that the examinee 
who produced these stories had been traumatized 
and was defending against self-destructive im- 
pulses. Correspondingly, the clinician would be well 
advised to explore these issues in psychotherapy. 
The psychometric adequacy of the TAT is diffi- 
cult to evaluate because of the abundance of scoring 


1. Card 3BM depicts one person—arguably male or female— 
kneeling or slumped over on a couch with head bowed on one 
arm. In the corner is a vaguely drawn object interpreted by some 
examinees to be a handgun or other weapon. 
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and interpretation methods. Clinicians defend the 
test on an anecdotal basis, pointing out remarkable 
and confirmatory findings such as illustrated here. 
However, data-minded researchers are more cau- 
tious. One problem is that formally scored TAT pro- 
tocols possess very low test-retest reliability, with 
areported median value of r = .28 (Winter & Stew- 
art, 1977). Furthermore, an astonishing 97 percent 
of test users employ subjective and “personalized” 
procedures for interpreting the TAT; that is, only a 
tiny fraction of clinical practitioners rely upon a 
standardized scoring system (Lilienfeld, Wood, & 
Garb, 2001). This is troubling because a consistent 
theme in research on projective testing is that intu- 
itive interpretations are likely to overdiagnose psy- 
chological disturbance. 

In large measure, then, the interpretation of the 
TAT is based on strategies with unknown and 
untested reliability and validity. Even so, advocates 
of the test remain undaunted, proposing that prac- 
titioners with psychodynamic expertise can use the 
instrument as a “magic set of optics without which 
psychologists have only partial psychological vi- 
sion... a means of inferring the vital secret wishes 
and unconscious fantasies that participants are not 
able to communicate directly” (Schneidman, 1999, 
p. 87). Obviously, a large chasm separates enthusi- 
astic clinical practitioners from skeptical empirical 
researchers in their assessment of the TAT. The lat- 
ter group has made an occasional effort to develop 
new TAT scoring approaches that might provide a 
solid empirical foundation for the test (McGrew & 
Teglasi, 1990; Ronan, Colavito, & Hammontree, 
1993). However, there is surprisingly little ongoing 
research on TAT scoring systems. 


The Picture Projective Test 


The Picture Projective Test (PPT) is a long overdue 
attempt to construct a general-purpose instrument 
with improved psychometric qualities (Ritzler, 
Sharkey, & Chudy, 1980; Sharkey & Ritzler, 1985). 
The developers of the PPT note that the majority of 
the TAT pictures exert a strong negative stimulus 
“pull” on storytelling. The TAT cards are cast in 
dark, shaded tones and most scenes portray persons 


in low-key or gloomy situations. It is not surprising, 
then, that projective responses to the TAT are 
strongly channeled toward negative, melancholic 
stories (Goldfried & Zax, 1965). 

In contrast, the PPT uses a new set of pic- 
tures taken from the Family of Man photo essay 
published by the Museum of Modern Art (1955). 
The following criteria were used in selecting 30 
pictures: 


The pictures had to show promise of eliciting 
meaningful projective material. 

Most but not all of the pictures had to include 
more than one human character. 

About half of the pictures had to depict humans 
showing positive affective expression (e.g., smil- 
ing, embracing, dancing). 

About half of the pictures had to depict humans 
in active poses, not simply standing, sitting, or 
lying down. 


In an initial pilot study, the authors compared 
TAT and PPT story productions of eight under- 
graduates on several variables such as length of sto- 
ries, emotional tone, and activity level (Ritzler, 
Sharkey, & Chudy, 1980). Compared to the TAT 
productions, the PPT stories were of comparable 
length but were much more positive in thematic 
content and emotional tone. The PPT stories were 
also much more active, meaning that the central 
character had an active, self-determined effect on 
the situation in the story. Furthermore, the PPT sto- 
ries placed greater emphasis upon interpersonal 
rather than intrapersonal themes. In other words, 
the PPT stories placed more emphasis on “healthy,” 
adaptive aspects of personality adjustment than did 
the TAT productions. 

The PPT developers also compared their in- 
strument against the TAT in a diagnostic validity 
study (Sharkey & Ritzler, 1985). PPT and TAT 
story productions of 50 subjects were compared: 
normals, nonhospitalized depressives, hospitalized 
depressives, hospitalized psychotics with good pre- 
morbid histories, and hospitalized psychotics with 
poor premorbid histories (10 subjects in each 
group). Although the TAT and PPT were essentially 
equal in their capacity to discriminate normal from 





depressed subjects, the PPT was superior in differ- 
entiating psychotics from normals and depressives. 
On the PPT, depressives told stories with gloomier 
emotional tone and psychotics made more percep- 
tual distortions, and thematic/interpretive devia- 
tions. The PPT appears to be a very promising 
instrument, although it is obvious that further re- 
search is needed on its psychometric qualities. One 
noteworthy feature is that anyone can purchase the 
PPT stimuli at their local bookstore. The requisite 
materials are found in the Family of Man photo col- 
lection (Museum of Modern Art, 1955). 


CHILDREN’S APPERCEPTION TEST 


Designed as a direct extension of the TAT, the Chil- 
dren’s Apperception Test (CAT) consists of 10 pic- 
tures and is suitable for children 3 to 10 years of 
age. The preferred version for younger children 
(CAT-A) depicts animals in unmistakably human 
social settings (Bellak & Bellak, 1991). The test de- 
velopers used animal drawings on the assumption 
that young children would identify better with ani- 
mals than humans. A human figure version (CAT- 
H) is available for older children (Bellak & Bellak, 
1994). No formal scoring system exists for the CAT 
and no statistical information is provided on relia- 
bility or validity. Instead, the examiner prepares a 
diagnosis or personality description based upon a 
synthesis of 10 variables recorded for each story: 
(1) main theme; (2) main hero; (3) main needs and 
drives of hero; (4) conception of environment (or 
world); (5) perception of parental, contemporary, 
and junior figures; (6) conflicts; (7) anxieties; 
(8) defenses; (9) adequacy of superego; (10) inte- 
gration of ego (including originality of story and 
nature of outcome) (Bellak, 1992). The lack of at- 
tention to psychometric issues of scoring, reliabil- 
ity, and validity of the CAT is troublesome to most 
testing specialists. 


Other Variations on the TAT 


The TAT has inspired a number of similar tests de- 
signed for children and older adults (Table 13.7). 
In addition, modifications and variations of the TAT 
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TABLE 13.7 Thematic Apperception Tests for 
Specific Populations 


Adolescent Apperception Cards 

This is the only thematic apperception test designed 
specifically for adolescents (12- to 19-year-olds). The 
11 cards represent contemporary issues relevant to ado- 
lescents; themes include loneliness, parenting styles, 
domestic violence, gang activity, and drug abuse (Sil- 
verton, 1993). Problems with this instrument include 
the negative themes depicted in the cards (which pre- 
clude positive associations) and the absence of any ob- 
jective approach to scoring. Like many thematic 
apperception techniques, the AAC is really an idio- 
graphic clinical tool, not a test. 


Blacky Pictures 

For children ages 5 and older, the Blacky Pictures test 
was also based on the premise that children identify 
more readily with animals than humans. The 11 cartoon 
stimuli depict the adventures of the dog Blacky and his 
family (Mama, Papa, and sibling Tippy). In addition to 
requesting a story for each card, the examiner also pre- 
sents multiple-choice questions based on stages of psy- 
chosexual development derived from psychoanalytic 
theory (Blum, 1950). Although the test was originally 
developed with adults, children enjoy taking the Blacky 
and are quite responsive to the pictures. Problems with 
this test include the absence of ncrms, especially for 
children, and poor stability of scores (LaVoie, 1987). 


Michigan Picture Test-Revised 

For older children ages 8 to 14 years, the MPT-R con- 
sists of 15 pictures and a blank card. Responses are 
scored for Tension Index (e.g., portrayal of personal ad- 
equacy), Direction of Force (whether the central figure 
acts or is acted upon), and Verb Tense (e.g., past, pre- 
sent, future). These three scores can be combined to 
yield a Maladjustment Index. Reliability and norms are 
adequate, although evidence of validity is unsatisfac- 
tory. A major problem with this test is that the cards 
portray interpersonal relationships so vividly that little 
is left to the child’s imagination (Aiken, 1989). 


Senior Apperception Test (SAT) 

Although the 16 situations depicted on the SAT cards 
include some positive circumstances, the majority of 
pictures were designed to reflect themes of helpless- 
ness, abandonment, disability, family problems, loneli- 
ness, dependence, and low self-esteem (Bellak, 1992). 
Critics complain that the SAT stereotypes the elderly 
and therefore discourages active responding (Schaie, 
1978; Klopfer & Taulbee, 1976). 
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have been developed for ethnic, racial, and linguis- 
tic minorities. One of the first was the Thompson 
TAT (T-TAT) in which 21 of the original TAT pic- 
tures were redrawn with African American figures 
(Thompson, 1949). This TAT modification incor- 
porated certain unintended changes—for example, 
in facial expressions and the situations portrayed. 
As a result, the T-TAT should be considered a new 
test and not just a TAT translation suited to African 
American individuals (Aiken, 1989). 

Another specialized TAT-like test is the 
TEMAS, which consists of 23 colorful drawings 
that depict Hispanic persons interacting in con- 
temporary, inner-city settings (Aiken, 1989; Con- 
stantino, Malgady, & Rogler, 1988). TEMAS is 
Spanish for themes and an acronym for “tell me a 
story.” The thematic content of TEMAS stories is 
scored for 18 cognitive functions, 9 personality 
(ego) functions, and 7 affective functions. The test 
can also be scored for various objective indices 
such as reaction time, fluency, unanswered in- 
quiries, and stimulus transformations (e.g., a letter 
is transformed into a bomb). Hispanic children re- 
spond well to the TEMAS, even though they may 
be inarticulate in response to traditional projective 
tests. 

The inconsistent reliability of the TEMAS is a 
source of concern, because reliability constrains 
validity. The manual reports that Cronbach’s alpha 
for the 34 scoring functions ranged from .31 to .98 
with half below .70. Test-retest reliabilities were 
even lower; the highest correlation was r = .53 and 
for 26 of the 34 functions the correlations were 
near zero! In spite of the questionable reliability of 
the instrument, several studies provide support for 
its concurrent and predictive validity. For example, 
in a clinical sample of 210 Puerto Rican children, 
TEMAS scale scores predicted independent cri- 
teria of ego development, trait anxiety, and adap- 
tive behavior reasonably well, with correlations 
ranging from .27 to .51 (Malgady, Constantino, 
& Rogler, 1984). A steady stream of research 
has continued to bolster the utility of this instru- 
ment, as surveyed by Constantino & Malgady 
(1996). Flanagan and di Guiseppe (1999) provide 


a critical review of the TEMAS; Constantino and 
Malgady (2000) describe recent developments 
with the test. 


| EXPRESSION TECHNIQUES 
The Draw-A-Person Test 


As the reader will recall from an earlier chapter, 
Goodenough (1926) used the Draw-A-Man task as 
a basis for estimating intelligence. Subsequently, 
psychodynamically minded psychologists adapted 
the procedure to the projective assessment of 
personality. Karen Machover (1949, 1951) was 
the pioneer in this new field. Her procedure be- 
came known as the Draw-A-Person Test (DAP). 
Her test enjoyed early popularity and is still widely 
used as a clinical assessment tool. Watkins, Camp- 
bell, Nieberding, and Hallmark (1995) report that 
projective drawings such as the DAP rank eighth 
in popularity among clinicians in the United 
States. 

The DAP is administered by presenting the ex- 
aminee with a blank sheet of paper and a pencil 
with eraser, then asking the examinee to “draw a 
person.” When the drawing is completed the exam- 
inee usually is directed to draw another person of 
the sex opposite that of the first figure. Finally, the 
examinee is asked to “make up a story about this 
person as if he [or she] were a character in a novel 
or a play” (Machover, 1949). 

Interpretation of the DAP proceeds in an en- 
tirely clinical-intuitive manner, guided by a number 
of tentative psychodynamically based hypotheses 
(Machover, 1949, 1951). For example, Machover 
maintained that examinees were likely to project 
acceptable impulses onto the same-sex figure and 
unacceptable impulses onto the opposite-sex fig- 
ure. She also believed that the relative sizes of the 
male and female figures revealed clues about the 
sexual identification of the examinee. Several of 
Machover’s interpretive hypotheses are listed in 
Table 13.8. 

These interpretive premises are colorful, inter- 
esting, and plausible. However, they are based 
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TABLE 13.8 Illustrative Interpretations of the Draw-A-Person Test 


Sign Hypothesized Interpretive Significance 
Disproportionately Organic brain disease; previous brain surgery; 
large head preoccupation with headaches 
Deliberate omission Evasive about highly conflictual interpersonal 
of facial features relationships 
Mouth drawn with Verbally aggressive, over-critical, and sometimes 


heavy line slash sadistic personality 


Chin changed, erased, 


Compensation for weakness, indecision, and fear 


or reinforced of responsibility 

Large male eyes with lashes Homosexually inclined male, often very extraverted 
Hair emphasis, e.g., a beard An indication of a striving for virility 

Graphic emphasis of the neck Disturbed about the lack of control over impulses 


Conspicuous treatment 
of index finger, thumb 
Anatomical indications 
of internal organs 


Preoccupation with masturbation 


Found only in schizophrenic or actively manic patients 





Source: Based on Machover, K. (1949). Personality projection in the drawing of the human figure. 


Springfield, IL: Charles C. Thomas. 


entirely upon psychodynamic theory and anecdotal 
observations. Machover made little effort to vali- 
date the interpretations. The empirical support for 
her hypotheses is somewhere between meager and 
nonexistent (Swensen, 1957, 1968). In favor of the 
DAP, the overall quality of drawings does weakly 
predict psychological adjustment (Lewinsohn, 
1965; Yama, 1990). However, judged by contem- 
porary standards of evidence, the sweeping and 
cavalier assessments of personality so often derived 
from the DAP are embarrassing. Some reviewers 
have concluded that the DAP is an unworthy test 
that should no longer be used (Gresham, 1993; 
Motta, Little, & Tobin, 1993). 

Rather than using the DAP to infer nuances of 
personality, a more appropriate application of this 
test is in the screening of children suspected of be- 
havior disorder and emotional disturbance. For this 
purpose, Naglieri, McNeish, and Bardos (1991) 
developed the Draw A Person: Screening Proce- 
dure for Emotional Disturbance (DAP:SPED). In 
one study, diagnostic accuracy of problem children 
was significantly improved by application of the 


DAP:SPED scoring approach (Naglieri & Pfeiffer, 
1992). 


The House-Tree-Person Test (H-T-P) 


The H-T-P is a projective test that uses freehand 
drawings of a house, tree, and person (Buck, 1948, 
1981). The examinee is given almost complete 
freedom in sketching the three objects; separate 
pencil and crayon drawings are requested. Al- 
though the examiner can improvise an H-T-P Test 
with mere blank pieces of paper, Buck (1981) rec- 
ommends the use of a four-page drawing form with 
identification information on the first page. Pages 
two, three, and four are titled House, Tree, and Per- 
son. Two drawing forms are needed for each ex- 
aminee, one for pencil drawings and the other for 
crayon drawings. Buck (1981) also provides a sep- 
arate four-page form for a postdrawing interroga- 
tion phase, which consists of 60 questions designed 
to elicit the examinee’s opinions about elements of 
the drawings. Many practitioners feel the post- 
drawing interrogation phase is not worth the 
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extended effort. Also, the value of separate crayon 
drawings is questioned (Killian, 1987). 

The House-Tree-Person Test has much the same 
familial lineage as the Draw-A-Person Test. Like 
the DAP Test, the H-T-P Test was originally con- 
ceived as a measure of intelligence, complete with 
a quantitative scoring system to appraise an ap- 
proximate level of ability (Buck, 1948). However, 
clinicians soon abandoned the use of the H-T-P as a 
measure of intelligence, and it is now used almost 
exclusively as a projective measure of personality. 

Although we will rot delve into any details 
here, the interpretation of the H-T-P rests upon 
three general assumptions: the House drawing mir- 
rors the examinee’s home life and intrafamilial re- 
lationships; the Tree drawing reflects the manner in 
which the examinee experiences the environment; 
and the Person drawing echoes the examinee’s 
interpersonal relationships. Buck (1981) provides 
numerous interpretive hypotheses for both quanti- 
tative and qualitative aspects of the three drawings. 

The H-T-P is an alluring test that has fascinated 
clinicians for more than 40 years. Unfortunately, 
Buck (1948, 1981) has never provided any evi- 
dence to support the reliability or validity of this in- 
strument. Indeed, he is perhaps his own worst critic. 
At one point in his test manual, he even asserts 
that validational research is not possible with the 
H-T-P (Buck, 1981, p. 164). Among the impedi- 
ments to such research, he cites the following points: 


1. No single sign itself is an infallible indication of 
any strength or weakness in the S. 

2. No H-T-P sign has but one meaning. 

3. The significance of a sign may differ markedly 
from one constellation to another. 

4. The amount of diagnostic and prognostic data 
derivable from each of the points of analysis 
may vary greatly from S to S. 

5. Colors'do not have any absolute and universal 
meaning. 

6. Nothing in the quantitative scoring system can be 
taken automatically at face value (Buck, 1981). 


In general, attempts to validate the H-T-P as a 
personality measure have failed miserably (for re- 


views see Ellis, 1970; Hayworth, 1970; Krugman, 
1970; Killian, 1987). Thoughtful reviewers have 
repeatedly recommended the abandonment of the 
H-T-P and similar figure-drawing approaches to 
personality assessment. But these pronouncements 
apparently fall on deaf ears. The popularity of the 
H-T-P and other projective techniques continues 
unabated. In the final section of this chapter, we 
offer some reflections on the continued acceptance 
of projective techniques. 





| REPRISE: THE PROJECTIVE 
| PARADOX 

The evidence is quite clear that personality infer- 
ences drawn from projective tests often are wrong. 
In the face of negative validational findings, the en- 
during practitioner acceptance of these tests con- 
stitutes what we have referred to as the projective 
paradox. How do we explain the continued popu- 
larity of instruments for which the validity evidence 
is at best mixed, often marginal, occasionally 
nonexistent, or even decisively negative? 

We offer two explanations for the projective 
paradox. The first is that human beings cling to pre- 
existing stereotypes even when exposed to contra- 
dictory findings. Decades ago, Chapman and 
Chapman (1967) demonstrated this phenomenon 
with projective tests, naming it illusory validation. 
These researchers asked college students to observe 
several human figure drawings similar to those ob- 
tained from the Draw-A-Person Test (DAP). The stu- 
dents were naive with respect to projective tests and 
knew nothing about traditional DAP interpretive hy- 
potheses. Each drawing was accompanied by brief 
descriptions of two symptoms which supposedly 
characterized the patient who produced the drawing. 
Actually, the symptoms were assigned randomly to 
drawings and consisted of the bits and pieces of DAP 
clinical lore that had been gleaned from an earlier 
mail questionnaire to clinical psychologists. For ex- 
ample, two of the symptoms used were these: 












1. Is worried about how manly he is 
2. Is suspicious of other people 





Each student received a different combination of 
drawings and randomly assigned symptoms. 

Later, the students were asked to demonstrate 
what they had learned by describing, for several 
drawings, the symptoms they had observed to be 
linked with that kind of drawing. Of course, in re- 
ality there was no learning to be demonstrated, 
since symptoms and drawings were randomly com- 
bined. Nonetheless, the participants responded in 
terms of popular clinical stereotypes (e.g., unusual 
eyes indicate suspiciousness, large head suggests a 
concern with intelligence). Apparently, the com- 
monsense stereotypes held by participants emerged 
robust and unscathed—in spite of an abundance of 
disconfirming examples. Perhaps something simi- 
lar occurs in all fields of projective testing: clini- 
cians notice the confirming instances, but ignore 
the more numerous findings which. contradict 
expectations. 
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The second explanation for the projective 
paradox is that many clinicians do not use projective 
methods as tests at all, but as auxiliary approaches 
to the clinical interview. These practitioners use pro- 
jective techniques as clinical tools to derive tentative 
hypotheses about the examinee. Most of these hy- 
potheses will turn out to be false when examined 
more closely. However, the few that are confirmed 
may have important implications for the clinical 
management of the examinee. Furthermore, we 
suspect that these fruitful hypotheses might not 
emerge—or might emerge more slowly—if the 
practitioner relied entirely upon the interview or 
used only formal tests with established reliability 
and validity (Case Exhibit 13.1). However, this as- 
sertion is difficult to test empirically. We remain 
open to the possibility that clinically successful ap- 
plications of projective techniques largely provide 
further evidence of illusory validation. 
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SUMMARY 


1. Projective tests are based upon the projec- 
tive hypothesis: Personal interpretations of ambigu- 
ous stimuli must necessarily reflect the unconscious 
needs, motives, and conflicts of the examinee. 
Popular projectives include the Rorschach inkblot 
test, the Thematic Apperception Test, sentence 
completion tests, and drawing tests (e.g., Draw-A- 
Person). 


2. The Rorschach, released in 1921, consists 
of 10 roughly symmetrical inkblots. For each card, 
the examiner asks “What might this be?” In the in- 
quiry phase, the examiner clarifies which aspects 
of the blot (e.g., form or color) played a part in the 
creation of each response. 


3. The preferred Rorschach scoring method 
by Exner codes each response for location, form, 
human movement, the use of color, content, and 
other variables. Summary scores and ratios of vari- 
ables provide hypotheses about personality func- 
tioning. In spite of its enduring popularity, the 
Rorschach is still haunted by questions about reli- 
ability and validity. 

4. The Holtzman Inkblot test consists of 45 
cards; a single response to each is required. Scor- 
ing categories for the HIT are highly reliable, with 
interscorer agreement generally in the .90s. Valid- 
ity studies using simple decision rules support the 
use of the HIT as an aid to psychodiagnosis, espe- 
cially in schizophrenia. 

5. The Rotter Incomplete Sentences Blank 
(RISB) contains 40 sentence stems written mostly in 
the first person. Each completed sentence receives 
an adjustment score from 0 (good) to 6 (poor); the 
sum is the overall adjustment score. Correct classi- 
fication rates (e.g., adjusted versus maladjusted) are 
too low for individual decision making. 


6. The Rosenzweig Picture Frustration Study 
(P-F Study) consists of 24 drawings, each showing 
two persons in a highly frustrating circumstance. 
The examinee provides the first verbal response that 
comes to mind. Objective ratings indicate typical 
modes of reacting to frustration. Owing to its low re- 
liability, the P-F Study is suited mainly to research. 


7. The Thematic Apperception Test (TAT) 
consists of 30 black-and-white drawings and pho- 
tographs; one card is blank. The examinee is asked 
to make up a dramatic story for each picture, includ- 
ing past, present, future, and feelings of the main 
characters. TAT interpretation usually rests upon 
clinical-qualitative analysis of story productions. 


8. Variations on the TAT include the Picture 
Projective Test, based upon photographs from the 
Family of Man photo essay; the Thompson TAT for 
African Americans; the Children’s Apperception 
Test (CAT) which utilizes drawings of animals; 
TEMAS, an apperception test designed for Hispanic 
persons; and apperception tests for elderly citizens. 


9. In Machover’s Draw-A-Person (DAP), the 
examinee is asked simply to “draw a person.’ Inter- 
pretation proceeds in a clinical-intuitive manner 
based upon published hypotheses, for example, a re- 
drawn chin indicates indecision. Another test in a 
similar vein is the House-Tree-Person (the exami- 
nee draws these) for which validity evidence is also 
meager. 


10. The projective paradox (enduring popular- 
ity of projective tests in spite of questionable valid- 
ity) can be explained, in part, by the phenomenon of 
illusory validation. Illusory validation is demon- 
strated when subjects ignore disconfirming in- 
stances and cling to their preexisting stereotypes. 
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Summary 
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; he history of personality assessment can be 
characterized by two overlapping trends. 


First, unstructured projective techniques such as the 
Rorschach test dominated personality testing in the 
early twentieth century and then waned in popular- 
ity. Second, structured approaches such as self- 
report inventories and behavioral ratings gained 
prominence in midcentury and then rapidly ex- 
panded in popularity. In the previous topic we in- 
troduced the reader to the many varieties of 
projective techniques. These methods are resplen- 
dent in the richness of the hypotheses they yield; 
however, projective techniques largely lack the 
approval of psychometrically oriented clinicians. 
In this chapter, we focus on the more objective 
methods for personality assessment favored by 
measurement-minded psychologists. In Topic 14A, 
Self-Report Inventories, we review true-false and 
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Structured Personality 
Assessment 


Self-Report Inventories 


forced-choice instruments, including the most 
widely used personality test ever, the Minnesota 
Multiphasic Personality Inventory (MMPIT), and its 
recent revision, the MMPI-2. In Topic 14B, Be- 
havioral Assessment and Related Approaches, we 
examine more recent approaches that rely upon be- 
havioral observations and ratings. 

Contemporary psychometricians have relied 
upon three tactics for test development: theory- 
bounded approaches, factor-analytic strategies, and 
criterion-key methods. We will organize the dis- 
cussion of self-report inventories around these 
three categories. Of course, the boundaries are 
somewhat artificial and many test developers use a 
combination of methods. 

The structured approaches to personality test- 
ing discussed in the following sections are steeped 
in the details of psychometric methodology. These 
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tests feature prominent references to reliability in- 
dices, criterion keying, factor analysis, construct 
validation, and other forms of technical craftsman- 
ship. For this reason, the approaches discussed here 
are often considered objective—as contrasted with 
projective. However, whether they are objective in 
any meaningful sense is really an empirical ques- 
tion that must be answered on the basis of research. 
Perhaps it is more accurate to call these methods 
structured. They are structured in the sense that 
highly specific rules are followed in the adminis- 
tration, scoring, and interpretation of the tests. In 
fact, some of the approaches are so completely 
structured that an examinee can answer questions 
presented on a computer screen and observe a com- 
puter-generated narrative report spewed forth from 
the printer seconds later.! 


[||| THEORY-GUIDED INVENTORIES 


The construction of several self-repori inventories 
was guided closely by formal or informa} theories 
of personality. In these cases, the test developer de- 
signed the instrument around a preexisting theory. 
Theory-guided inventories stand in contrast to fac- 
tor-analytic approaches which often produce a ret- 
rospective theory based upon initial test findings. 
Theory-guided inventories also differ from the 
stark atheoretical empiricism found in criterion-key 
instruments such as the MMPI and MMPI-2. Ex- 
amples of theory-guided inventories include the 
Edward Personal Preference Schedule (EPPS) and 
the Personality Research Form (PRF), both based 
on Murray’s (1938) need-press theory of personal- 
ity. Further examples include the Myers-Briggs 
Type Indicator (MBTI), which represents an appli- 
cation of Carl Jung’s theory of personality types. 
The Jenkins Activity Survey, designed to assess the 
Type A coronary-prone behavior pattern, also epit- 
omizes a theory-guided instrument. Finally, some 
theory-guided inventories such as the State-Trait 
Anxiety Inventory (STAI) attempt to measure very 


1. Computerized narrative reports may not be altogether a pos- 
itive development. We discuss the benefits and pitfalls of com- 
puter-generated reports in the next chapter. 


specific components of personality. Following we 
review each of these tests in more detail. 


Edwards Personal Preference Schedule 


The Edwards Personal Preference Schedule 
(EPPS) was the first attempt to measure Murray’s 
(1938) manifest needs with a structured personal- 
ity inventory (Edwards, 1959; Helms, 1983). The 
reader will recall from an earlier discussion that 
Murray posited 15 needs and developed a projec- 
tive test, the Thematic Apperception Test, to tap 
those needs. Edwards, a consummative psychome- 
trician well versed in the nuances of measurement 
theory, sought to develop an objective, structured 
test to measure those 15 needs in a more reliable 
and valid manner. The 15 needs are listed next to 
an EPPS profile in Figure 14.1. 

The EPPS consists of 210 pairs of statements in 
which items from each of the 15 scales are paired 
with items from the other 14. The inventory uses a 
forced-choice format in which the examinee must 
choose the one statement from each pair that is 
most personally representative. The forced-choice 
format of the EPPS is peculiar and uncomfortable 
to most test takers, because it often serves up the 
proverbial choice between a rock and a hard place. 
Here are three EPPS-like items; for each item, the 
examinee must choose the one statement that is 
most personally characteristic: 


1. A. I like to talk in front of a group. 
B. I like to work toward self-chosen goals. 
2. A. I feel sad when I watch a tragic news story 
on TV. i 
B. Ifeel nervous when I have to speak before a 
group. : 
3. A. I wouldn’t mind mopping up ten gallons of 
syrup. 
B. I wouldn’t mind scaling a steep cliff on a 
safety rope. 


Why did Edwards adopt this awkward format 
for his test? The answer has to do with the prob- 
lem of social desirability response set. Social 
desirability response set is the tendency of ex- 
aminees to react to the perceived desirability (or 
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undesirability) of a test item rather than responding 
accurately to its content. Put simply, examinees 
tend to endorse socially desirable statements and 
tend not to endorse socially undesirable state- 
ments—regardless of the truth value of the re- 
sponses.? Most persons would respond true to a 
statement such as “I enjoy helping older persons 


2. We should mention here that a social desirability response 
set is a natural human tendency found in nearly every examinee. 
The extreme form of this tendency is the conscious, deliberate 
attempt to “fake good” on personality tests. But social desir- 
ability need not betoken deliberate deception on the part of the 
respondent. Social desirability is always present to some degree, 
even when the examinee adopts a “good faith” perspective in 
taking a personality test. 
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across the street” because the item sanctions a so- 
cially desirable attribute; and most persons would 
respond false to a statement such as “At times I 
have fantasized about the death of my parents” be- 
cause the item authorizes a socially undesirable 
quality. But for some persons, the socially desirable 
answer is not really accurate. After all, in truth 
many persons really do not enjoy helping others, 
and most individuals have fantasized about un- 
pleasant possibilities. 

The elegance of the EPPS is that pairs of state- 
ments in each item are matched for social desir- 
ability (Edwards, 1957). Because each statement in 
an item pair is of equal social desirability, the con- 
tent of each statement will exert more “pull” in 
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determining the examinee’s choice. Of course, Ed- 
wards carefully designed the content of the state- 
ments to incorporate Murray’s needs. By making 
all possible pairwise comparisons between state- 
ments embodying the 15 needs, the EPPS produces 
a measure of the relative strength of each of Mur- 
ray’s needs. 

Because each need is paired twice with the 
other 14 needs, the maximum possible raw score on 
each scale is 2 x 14 or 28. This score would occur 
if the examinee chose the statement for a given need 
as being more personally characteristic than all the 
other needs it was paired with. Of course, the low- 
est possible score on a need scale would be zero. 
The EPPS also incorporates a consistency check by 
repeating 15 items in identical format. 

The reader should recognize that the EPPS is an 
ipsative test. On an ipsative measure, the overall 
score—averaged across all subtests—is always the 
same for every examinee. For example, because the 
EPPS is an ipsative measure, the overall average 
scale score is always 14, and high scale scores must 
be counterbalanced by low ones. Remember, too, 
that on an ipsative scale, high scores are relative, 
not absolute. In other words, the strength of each 
need is expressed not absolutely but relative to the 
strength of the examinee’s other needs. It is there- 
fore confusing that Edwards (1959) also recom- 
mends reporting EPPS scores in a normative 
format. The manual provides T scores and per- 
centiles by which an examinee’s raw scale scores 
can be compared to results from college and gen- 
eral adult samples (see Figure 14.1). By combining 
two incompatible test methodologies (ipsative and 
normative) in the same instrument, Edwards makes 
the interpretation of his test confusing. 

The EPPS is widely used in college counseling 
as a means of personal discovery and receives occa- 
sional activity as a research tool. However, some re- 
viewers regard the EPPS as an exercise in test 
construction rather than a serious entry into the 
market of validated tests (Heilbrun, 1972; Drum- 
mond, 1987). Many clients become frustrated and 
bored when taking the test. Furthermore, the stan- 
dardization is outdated and the reliability findings 
are not particularly exciting. For example, the test- 


retest reliability of the 15 scale scores ranges from 
.55 to .87, with a median of .73. Cooper (1990) con- 
cluded that the norms reported in Edwards (1959) 
do not correspond to more recent normative studies. 

Early attempts to validate the EPPS by com- 
paring ratings of the strength of Murray’s needs 
with scores on the EPPS met with mixed success 
(Drummond, 1987). However, a recent study by 
Piedmont, McCrae, and Costa (1992) provides 
strong support for the validity of the EPPS. These 
investigators correlated EPPS scores with NEO Per- 
sonality Inventory (NEO PI) scores for 330 under- 
graduate subjects. The NEO PI, discussed later, 
measures five constructs: Neuroticism, Extraver- 
sion, Openness to Experience, Agreeableness, and 
Conscientiousness. The pattern of relationships was 
supportive of the convergent validity of both in- 
struments, with scales showing appropriate and the- 
ory-confirming correlations. For example, EPPS 
Aggression correlated .47 with NEO PI Neuroti- 
cism and —.53 with NEO PI Agreeableness. The 
relationships were strongest and most theory- 
confirming when the EPPS was scored in the nor- 
mative fashion. The ipsative, forced-choice format 
of the EPPS apparently lowered validity coefficients 
and decreased convergent and discriminant validity. 


Personality Research Form 


Another test based on Murray’s need system is the 
Personality Research Form (Jackson, 1970, 
1984b). This test is available in several forms that 
differ in the number of scales or number of items 
per scale. In addition to parallel short tests (forms 
A and B), the PRF also exists as parallel long forms 
(forms AA and BB). These forms, used primarily 
with college students, consist of 440 true-false 
items. The long forms yield 20 personality scale 
scores and two validity scores, Infrequency and De- 
sirability (Table 14.1). The most popular version of 
the PRF is form E, which consists of all 22 scales 
in a modified 352-item test. 

In constructing the PRF long forms, Jackson 
first developed precise and detailed descriptions of 
the constructs to be measured. Next, for each scale 
over 100 items were written to tap the traits un- 
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TABLE 14.1 Personality Research Form Scales 


Scale Interpretation of High Score 
Abasement Self-effacing, humble, blame-accepting 
Achievement Goal striving, competitive 
Affiliation Friendly, accepting, sociable 
Aggression Argues, combative, easily annoyed 
Autonomy Independent, avoids restrictions 
Change Avoids routine, seeks change 
Cognitive Structure Prefers certainty, dislikes ambiguity 
Defendence On guard, takes offense easily 
Dominance Influential, enjoys leading 
Endurance Persevering, hard-working 
Exhibition Dramatic, enjoys attention 
Harm Avoidance Avoids risk and excitement 
Impulsivity Impulsive, speaks freely 
Nurturance Caring, sympathetic, comforting 
Order Organized, dislikes confusion 
Play Playful, light-hearted, enjoys jokes 
Sentience Notices, remembers sensations 
Social Recognition Concern for reputation and approval 
Succorance Insecure, seeks reassurance 
Understanding Values logical thought 
Desirability Validity Scale: favorable presentation 
Infrequency Validity Scale: infrequent responses 





Source: Adapted from Personality Research Form Scales and Descriptions from Jackson, D. N. (1989). 
Personality research form manual (3rd ed.). Port Huron, MI: Sigma Assessment Systems, Inc., Research 
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derlying the hypothesized needs. After editorial 
review, these items were administered to large sam- 
ples of college students. Based upon high biserial 
correlations with total scale scores and low correla- 
tion with scores on the other scales and the Desir- 
ability scale, 20 items were selected for each scale. 

Unlike many other personality inventories, the 
PRF scales have no item overlap. As a result, the 
scales are exceptionally independent, with most in- 
tercorrelation coefficients in the vicinity of +.30 
(Gynther & Gynther, 1976). Furthermore, the rig- 
orous scale construction procedures employed by 
Jackson (1970) yielded scales with exceptionally 
good internal consistency, with a range of .80 to .94 
and median of .92. A desirable feature of the PRF 
is its readability: The test requires only a fifth- or 


sixth-grade reading level (Reddon & Jackson, 
1989). MacLennan (1992) has prepared an anno- 
tated bibliography of over 375 citations to the PRF. 

The construct validity of the PRF rests espe- 
cially upon confirmatory factor analyses corrobo- 
rating the grouping of the items into 20 scales 
(Jackson, 1970, 1984b). In addition, research indi- 
cates positive correlations with comparable scales 
on other inventories (Mungas, Trontel, & Wein- 
gardner, 1981). For example, Edwards and Abbott 
(1973) found exceptionally strong and confirma- 
tory correlations between similar scales on the PRF 
and the Edwards Personality Inventory (EPI; Ed- 
wards, 1967). The EPI is a respected but little-used 
test consisting of 1,200 (!) true-false questions. 
Some of the confirmatory correlations between 


524 _CHAPTER 14 STRUCTURED PERSONALITY ASSESSMENT 


PRF and EPI scales for 218 male and female col- 
lege students are reported as follows: 


Achievement (PRF) x Is a Hard Worker (EPI) .74 
Change (PRF) x Likes a Set Routine (EPI) | —.54 
Nurturance (PRF) x Helps Others (EPI) .64 
Succorance (PRF) x Dependent (EPI) rele 


Because these instruments were developed indepen- 
dently according to different test construction 
philosophies, the findings bolster the validity of both 
tests. Several recent empirical comparisons also sup- 
port the validity and utility of the PRF. For example, 
Goffin, Rothstein, and Johnston (2000) proved that 
the PRF outperformed the more widely used Sixteen 
Personality Factor Questionnaire (16PF, discussed 
later in this section) in predicting the job perfor- 
mance of 487 candidates for managerial positions. 
Vernon (2000) also reports favorably on the validity 
of the PRF in his review of recent studies. 


Myers-Briggs Type Indicator (MBTI) 


Originally published in 1962, the MBTI is a forced- 
choice, self-report inventory that attempts to clas- 
sify persons according to an adaptation of Carl 
Jung’s theory of personality types (Myers & Mc- 
Caulley, 1985; Tzeng, Ware, & Chen, 1989). The 
instrument comes in a 166-item version (Form F) 
and a 126-item version (Form G). We mainly dis- 
cuss Form F here, because it is the most widely 
used. In fact, the MBTI may be the most widely 
used personality test of any kind with nonpsychi- 
atric populations (DeVito, 1985). 

The MBTI is scored on four theoretically inde- 
pendent dimensions: Extraversion-Introversion, 
Sensing-iNtuition, Thinking-Feeling, Judging-Per- 
ceptive. Although scores on each bipolar dimension 
are continuous, it is common practice to summa- 
rize an examinee’s scores in a typological manner. 
For example, an examinee might score more to- 
ward Extraversion, iNtuition, Feeling, and Percep- 
tive, and thereby obtain a summary type of ENFP.* 
Such a profile would suggest the following person- 


3. Because there are two poles to each of the four dimensions, 
the number of possible personality types is 24, which is 16. 


ality characteristics: a greater relatedness to the 
outer world of people and things than to the inner 
world of ideas (E); a tendency to look for possibil- 
ities rather than to work with known facts (N); a 
bias for basing judgments on personal values rather 
than analysis and logic (F); and preference for a 
flexible, spontaneous way of life rather than a 
planned, orderly existence (P). 

Standardization data for the MBTI consists of 
percentile norms for the four indicators scores, de- 
rived from small samples of high school and college 
students. Split-half reliabilities for the four scales 
are listed in the .70s and .80s. Perhaps in part be- 
cause supportive validity studies are scant, the ty- 
pological interpretation of the MBTI is generally 
not well received by measurement-oriented psy- 
chologists. One problem with this approach is that 
the interpretations seem too slick and simple, pos- 
sessing an almost horoscope-like quality. In fairness 
to the MBTI, there are more sophisticated ways to 
interpret the instrument, as revealed by an explosion 
of recent research. More than 400 references citing 
the MBTI were found in PsychINFO from 1992 
through 2002. For example, in a study of 177 man- 
agers, Higgs (2001) reported a significant relation- 
ship between emotional intelligence and the 
dominant MBTI function of Intuition. Emotional in- 
telligence is monitoring emotions of self and others 
and using this information to guide thinking and ac- 
tions (Mayer & Salovey, 1993). A positive relation- 
ship with MBTI Intuition is strong support for the 
construct validity of this dimension. In a review of 
17 studies reporting reliability coefficients, Capraro 
and Capraro (2002) found respectably strong relia- 
bility estimates for the four MBTI dimensions, with 
average coefficients of .84 (EI), .84 (SN), .67 (TF), 
and .82 (JP). The MBTI appears to be edging its 
way into the mainstream of psychological testing as 
researchers investigate this instrument with tradi- 
tional empirical procedures. Kaufman, McLean, 
and Lincoln (1996) list relevant references. 


Measures of Type A Behavior 


By way of quick review, Type A behavior refers to 
a hard-driving, aggressive behavior pattern that 


might be called “hurry sickness” (Friedman & 
Rosenman, 1974). Several questionnaire measures 
of Type A behavior are available for research pur- 
poses. The most recent is the Time Urgency and 
Perpetual Activation (TUPA) scale (Wright, Mc- 
Curdy, & Rogoll, 1992). The TUPA scale is re- 
spected by researchers in behavioral medicine 
because of its psychometric excellence. As evi- 
dence of the utility of the instrument, scores on the 
TUPA mildly predict a number of physical health 
problems in college students, including respiratory 
illnesses, pain, and sensory disturbances (Wright, 
Nielsen, Abranato, Jackson, & Lancaster, 1995). 
The best-known and most widely used instru- 
ment of this type is the Jenkins Activity Survey 
(JAS). The JAS is a 52-item, multiple-choice, self- 
report questionnaire designed to identify the Type 
A coronary-prone behavior pattern discussed in the 
previous chapter (Jenkins, Zyzanski, & Rosenman, 
1979). Items on the JAS resemble the following: 
Currently, do you consider yourself to be: 


A. Definitely competitive and ambitious? 
B. Probably competitive and ambitious? 
C. Probably more relaxed and easygoing? 
D. Definitely more relaxed and easygoing? 


In addition to the composite Type A behavior score, 
the JAS yields three factor-analytically derived sub- 
scales: Speed and Impatience, Job Involvement, and 
Hard-Driving/Competitiveness. Correlations be- 
tween the composite Type A scale and the three sub- 
scales are modest (.42 to .67), indicating that the 
factor scores may provide independent contribu- 
tions to the assessment of Type A tendencies. The 
JAS is normed on 2,588 employed middle-class 
males ages 48 to 65 years. The instrument was stan- 
dardized to have a mean of 0.0 and standard devia- 
tion of 10.0, with positive scores indicating Type A 
tendencies, negative scores indicating the opposite, 
Type B tendencies. 

The Type A behavior pattern also consists ofin- 
security of status, hyperaggressiveness, free-float- 
ing hostility, and a sense oftime urgency (Friedman 
& Ulmer, 1984). Some studies indicate that persons 
with this behavior pattern are at increased risk of 
coronary heart disease (CHD). Early identification 
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of high-risk individuals therefore might have por- 
tentous implications for intervention. Prior to the 
JAS, a lengthy structured interview provided the 
only means for identifying persons with the Type A 
behavior pattern. The JAS was developed in an at- 
tempt to duplicate the structured interview, thereby 
providing a quick and economical method of 
screening for Type A behavior. 

Unfortunately, the JAS has not fulfilled its am- 
bitious aspirations. The test-retest reliability of the 
three subscales is marginal at best, with values as 
low as .58 for Speed and Impatience, .66 for Job 
Involvement, and .71 for Hard-Driving/Competi- 
tiveness (Bishop, Hailey, & O’ Rourke, 1989; Ig- 
bokwe, 1989). Furthermore, the level of agreement 
between the structured interview and JAS scores 
is only fair, not strong enough to warrant the use 
of this test for individual diagnosis (Yarnold & 
Bryant, 1988). Another problem with the JAS is 
that patients with CHD do not differ from general 
medical patients on its subscales. In comparing 40 
patients with CHD and 40 patients with other med- 
ical problems, Wright (1992) found that the Speed 
and Impatience scale produced a significant and ap- 
propriate difference, but Hard-Driving/Competi- 
tiveness yielded a significant difference in ‘the 
wrong direction—the CHD patients scored lower 
than the non-CHD medical patients. 

In addition, the norms are obviously not repre- 
sentative, because they do not include women, 
young or elderly, or persons from lower social 
strata. The JAS also is difficult to score by hand 
because of the complex weighting system used. 
Blumenthal (1985) offers a not too flattering re- 
view of the JAS and recommends its use only for 
clinical and experimental research. Nonetheless, 
researchers continue to find value in the JAS, al- 
though it is now recognized that specific subscores 
are more predictive of health problems than the 
overall global score (Hart, 1997). For example, 
Palmero, Diez, and Asensio (2001) evaluated phys- 
iological measures of cardiac reactivity in 89 col- 
lege students undergoing an actual examination. 
Those who scored high on the Speed and Impa- 
tience subscale of the JAS showed higher cardiac 
reactivity and took longer to recover their initial 
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heart rate values than those who scored low on the 
subscale. Thus, this specific subscale does con- 
tribute meaningfully to the assessment of one com- 
ponent of the Type A behavior pattern. 


State-Trait Anxiety Inventory (STAI) 


The STAI is a short, quick measure of state and trait 
anxiety that has received high marks for technical 
merit (Spielberger and others, 1983). State anxiety 
is the transitory feelings of fear or worry that most 
of us experience on occasion. Trait anxiety is the 
relatively stable tendency of an individual to respond 
anxiously to a stressful predicament. These two con- 
structs are separate but related: The level of trait anx- 
iety reflects the proneness to display state anxiety. 

The STAI is a 40-item measure that assesses 
both types of anxiety separately. Items on the STAI 
are simple, descriptive terms such as “high-strung,” 
“secure,” and “relaxed.” The 20 State-anxiety items 
are each rated on a four-point intensity scale, la- 
beled “Not At All,” “Somewhat,” “Moderately So,” 
and “Very Much So.” Examinees are instructed to 
rate these items for how they feel “right now.” The 
20 Trait-anxiety items are each rated on a four- 
point intensity scale that is labeled “Almost Never,” 
“Sometimes,” “Often,” and “Almost Always.” Ex- 
aminees are instructed to rate these items for how 
they “generally feel.” 

The STAI is used with high school students, 
college students, and adults. A similar children’s 
version, the State-Trait Anxiety Inventory for Chil- 
dren, is targeted for elementary and junior high 
school students. The technical aspects of this test 
are generally good, and the standardization samples 
are large and representative. For example, median 
test-retest reliability is .88 for trait anxiety and also 
respectable but predictably lower for state anxiety 
(median of .70 in seven studies) (Barnes, Harp, & 
Jung, 2002). Two weak points of the STAI should 
be mentioned. First, the construct of trait anxiety is 
not well defined and seems to include unrelated 
traits such as a general feeling of dissatisfaction 
with oneself (Chaplin, 1984). In addition, the STAI 
is a totally face valid instrument that can be faked 
with impunity. For this reason, results of the STAI 





must be interpreted with considerable caution, par- 
ticularly when situational demands for “healthy” 
test results are strong. Spielberger (1984) has pub- 
lished a comprehensive bibliography of STAI re- 
search studies. A recent review of research and 
applications with the STAI is provided by Spiel- 
berger, Sydeman, Owen, and Marsh (1999). 


FACTOR-ANALYTICALLY 
| DERIVED INVENTORIES 


Sixteen Personality Factor 
Questionnaire (16PF) 


The 16PF is a widely used forced-choice test of per- 
sonality that is currently available in five separate 
forms. Each form consists of declarative stems that 
require the examinee to respond to a specific situ- 
ation by choosing from among two (Form E) or 
three forced-choice options (Forms A, B, C, and D). 
Examples of 16PF-like items include the following: 


I make decisions based on 
a. feelings 
b. feelings and reason equally 
c. reason 
Which of the following items is different from 
the others?4 
a. candle 
b. star 
c. lightbulb 
I find it hard to give a speech to strangers. 
a. yes 
b. somewhat 
c. no 


The forms contain from 105 to 187 items and dif- 
fer mainly in reading level (from third-grade to sev- 
enth-grade level). The test is untimed and is usually 
completed in 30 to 60 minutes. 


4. The inclusion of what appear to be intelligence test items in 
the 16PF may seem curious to the reader. In fact, psychologists 
have long recognized that personality and intelligence are com- 
plexly intertwined. Most test builders have addressed this 
quandary by attempting to tease personality and intelligence apart. 
Cattell decided instead to exploit the overlaps between personal- 
ity and intelligence by including elements of both in the same test. 


The 16PF is intended for high school seniors 
and adults. Most norms date back to 1970, a signif- 
icant shortcoming of this test. However, Form E has 
been recently normed for highly diverse popula- 
tions, including prison inmates, patients with schiz- 
ophrenia, culturally disadvantaged individuals, and 
physical rehabilitation clients. Nonetheless, most 
practitioners would concede that the 16PF is more 
suited to a “normal” rather than an “emotionally 
disturbed” population. The 16PF also is useful for 
cross-cultural applications (e.g., Argentero, 1989). 

The 16PF is predicated on Cattell’s factor-ana- 
lytic conception of personality (Cattell, Eber, & 
Tatsuoka, 1970). According to this model, surface 
traits—the more obvious aspects of personality— 
emerge from simple cluster analyses of test re- 
sponses. In contrast, source traits—the stable, 
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constant, but less-visible wellsprings of behavior— 
emerge only from specialized factor analyses of the 
surface traits (Cattell, 1950). In a series of studies, 
Cattell determined that 16 personality factors or 
source traits are needed to explain the structure of 
test responses, hence the name for his instrument. 

The 16PF yields a total of 20 indices or attributes 
of personality. In addition to the 16 basic scales, four 
second-order indices of personality are computed 
from weighted linear sums of the previous 16 in- 
dices, yielding a total of 20 bipolar scales. Over the 
years, the meaning of extreme scores in either direc- 
tion has been well established (Table 14.2). 

The major thrust of application for the 16PF is 
career guidance, vocational exploration, and occu- 
pational testing. One reason for the popularity of 
the instrument—it is second in use only to the 


TABLE 14.2 The 16 Personality Factors and 4 Second-Order Indices from the 16PF 


Factor Name Interpretation of Low Score Interpretation of High Score 
Warmth reserved, detached, cool, impersonal warm, outgoing, likes people 
Intelligence concrete thinking abstract thinking, bright 
Emotional Stability emotionally less stable, changeable emotionally stable, calm, mature 
Dominance submissive, conforming, mild dominant, assertive, competitive 
Impulsivity serious, prudent, sober, taciturn enthusiastic, cheerful, heedless 
Conformity expedient, disregards rules conforming, persevering, moralistic 
Boldness shy, timid, restrained bold, uninhibited, spontaneous 
Sensitivity tough-minded, self-reliant tender-minded, sensitive 
Suspiciousness trusting, adaptable suspicious, hard to fool, opinionated 
Imagination practical, conventional impractical, absent-minded, unconventional 
Shrewdness forthright, genuine, unpretentious calculating, polished, socially alert 
Insecurity confident, self-satisfied, secure self-blaming, worrying, troubled 
Radicalism conservative, resisting change liberal, analytical, innovative 
Self-sufficiency group-oriented, sociable resourceful, self-sufficient 
Self-discipline undisciplined, impulsive compulsive, socially precise 
Tension relaxed, tranquil, low drive frustrated, driven, tense 
Extraversion (Q;) introversion extraversion 
Anxiety (Qy) low anxiety high anxiety 
Tough Poise (Qm) sensitivity, emotionalism tough poise 
Independence (Q;,) dependence independence 





Source: Based on Wholeben, B. E. (1987). Sixteen Personality Factor Questionnaire. In D. J. Keyser & R. C. Sweetland (Eds.), Test cri- 
tiques compendium. Kansas City, MO: Test Corporation of America. Also, Cattell, R. B. (1986). The handbook for the 16 Personality 
Factor Questionnaire. Champaign, IL: Institute for Personality and Ability Testing. 


528 CHAPTER 14 STRUCTURED PERSONALITY ASSESSMENT 


MMPI/MMPI-2—is that answer sheets can be 
mailed in for quick-turnaround machine scoring. 
Most practitioners also request a computer-gener- 
ated narrative report. An attractive feature of these 
reports is the wealth of information provided. Re- 
ports include a capsule personality description, 
score profile, and summary of clinical signs, cog- 
nitive factors, and need patterns. 

A major shortcoming of the 16PF is that the 16 
surveyed personality attributes are based upon as 
few as 10 to 13 items each. Inevitably, a test with 
scales as short as these will possess diminished 
reliability. Not surprisingly, the split-half reliabili- 
ties of the 16 factors are as low as .54; correlations 
between the same scales for different forms of the 
test typically hover around .50; and test-retest co- 
efficients for scales on the same form are .70 to .80 
for same or next-day administrations, but much 
lower for longer intervals. 

Most of the validity evidence for the 16PF con- 
sists of statistical demonstrations that items “be- 
long” on their respective scales and that the scales 
consist of relatively pure factors. The evidence in 
this regard is reasonably encouraging (Cattell, Eber, 
& Tatsuoka, 1970). In addition, some studies with 
the 16PF demonstrate that the real-world correlates 
of test results are theory-consistent. For example, 
Cattell and Nesselroade (1967) studied the similar- 
ity of 16PF profiles in 102 stably and 37 unstably 
married couples. These authors discovered that sta- 
bly married couples are much, much more similar 
on the 16PF than unstably married couples. In Table 
14.3, the reader will notice that scale correlations 
for stably married couples are almost uniformly 
positive—meaning that these couples produce sim- 
ilar 16PF scale scores—whereas more than half the 
scale correlations for unstably married couples are 
negative—meaning that these couples often pro- 
duce dissimilar 16PF scale scores. These findings 
support the viewpoint that likeness facilitates stable 
marriages.° More importantly, the results bolster the 
validity of the 16PF by showing that test results 


5. Another possibility is that in stable marriages the personali- 
ties of the partners evolve toward increasing similarity over the 
years. Of course, both factors could operate simultaneously. 


carry meaningful and predictable real-world impli- 
cations. H. Cattell provides a recent update on the 
development and interpretation of the 16PF. 


Eysenck Personality Questionnaire 


The Eysenck Personality Questionnaire (EPQ) is 
the latest in a series of tests designed to measure 
the major dimensions of normal and abnormal 


. personality (Eysenck & Eysenck, 1975). Based on a 


lifelong program of factor-analytic questionnaire 
research and laboratory experiments on learning 
and conditioning, Eysenck isolated three major 
dimensions of personality: Psychoticism (P), Extra- 
version (E), and Neuroticism (N). The EPQ consists 
of scales to measure these dimensions and also in- 
corporates a Lie (L) scale to assess the validity of an 
examinee’s responses. The EPQ contains 90 state- 
ments answered “yes” or “no” and is designed for 
persons aged 16 and older. A Junior EPQ containing 
81 statements is suitable for children ages 7 to 15. 
Items on the P scale resemble the following: 


Do you often break the rules? (T) 
Would you worry if you were in debt? (F) 
Do you take risks just for fun? (T) 


High scores on the P scale indicate aggressive and 
hostile traits, impulsivity, a preference for liking 
odd or unusual things, and empathy defects. Anti- 
social and schizoid patients often obtain high 
scores on this dimension. In contrast, low scores on 
P foretell more desirable characteristics such as 
empathy and interpersonal sensitivity. Items on the 
E scale resemble the following: 


Do you like to meet new people? (T) 
Are you quiet when with others? (F) 
Do you like lots of excitement? (T) 


High scores on the E scale indicate a loud, gregar- 
ious, outgoing, fun-loving person. Low scores on 
the E scale indicate introverted traits such as a pref- 
erence for solitude and quiet activities. Items on the 
N scale resemble the following: 


Are you a moody person? (T) 
Do you feel that life is dull? (T) 
Are your feelings easily hurt? (T) 
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TABLE 14.3 Intercorrelations of 16PF Factors for Stably and Unstably 
Married Spouses 
Correlation for Stably Correlation for Unstably p Value for 
16PF Factor Married (N = 102) Married (N = 37) Difference 
A Warmth .16 -.50 < .001 
B Intelligence 3 21 ns 
C Ego Strength 32 .05 ns 
E Dominance .13 31 ns 
F Impulsivity 23 -.40 < .001 
G Conformity 33 19 ns 
H Boldness 23 R ns 
I Sensitivity —.15 —.13 ns 
L Suspiciousness .18 —.33 <.01 
M Imagination my -.01 ns 
N Shrewdness .18 .27 ns 
O Insecurity ll .36 ns 
Q, Radicalism .27 34 ns 
Q, Self-Sufficiency 15 —.32 <.01 
Q, Self-Discipline .27 -.02 ns 
Q, Tension .16 -11 ns 
Q, Extraversion 22 —.30 <.01 
Qy Anxiety 31 23 ns 





Source: Adapted with permission from Cattell, R. B., & Nesselroade, J. R. (1967). Likeness and com- 
pleteness theories examined by Sixteen Personality factor measures on stably and unstably married 


couples. Journal of Personality and Social Psychology, 7, 351-361. 


The N scale reflects a dimension of emotionality 
that ranges from nervous, maladjusted, and overe- 
motional (high scores) to stable and confident (low 
scores). 

The reliability of the EPQ is excellent. For ex- 
ample, the one-month test-retest correlations were 
.78 (P), .89 (E), .86 (N), and .84 (L). Internal con- 
sistencies were in the .70s for P and the .80s for the 
other three scales. The construct validity of the 
EPQ is also well established through dozens of 
studies using behavioral, emotional, learning, at- 
tentional, and therapeutic criteria (reviewed in 
Eysenck & Eysenck, 1976, 1985). Friedman (1987) 
provides a short but thorough introduction to other 
sources on the EPQ. 

A major focus of research with the EPQ has 
been on the empirical correlates of extraversion 
and its polar opposite, introversion. Eysenck and 


Eysenck (1975) describe the typical extravert as 
follows: 


The typical extravert is sociable, likes parties, has 
many friends, needs to have people to talk to, and 
does not like reading or studying by himself. He 
craves excitement, takes chances, often sticks his 
neck out, acts on the spur of the moment, and is 
generally an impulsive individual. 


They describe the typical introvert as follows: 


The typical introvert is a quiet, retiring sort of per- 
son, introspective, fond of books rather than peo- 
ple; he is reserved and distant except to intimate 
friends. He tends to plan ahead, “looks before he 
leaps,” and mistrusts the impulse of the moment. 


Eysenck and his followers have linked a num- 
ber of perceptual and physiological factors to the 
extraversion/introversion dimension. Because of 
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space limitations, we can only list representative 
findings here: 


¢ Introverts are more vigilant in watchkeeping. 

e Introverts do better at signal-detection tasks. 

e Introverts are less tolerant of pain but more tol- 
erant of sensory deprivation. 

Extraverts are more easily conditioned to stimuli 
associated with sexual arousal. 

Extraverts have a greater need for external 
stimulation. 


Aiken (1989) summarizes additional research on 
the real-world correlates of the EPQ extraver- 
sion/introversion dimension. 

In general, the technical characteristics of the 
EPQ are very strong, certainly stronger than found 
in most self-report inventories. The practical utility 
of the instrument is supported by voluminous re- 
search literature. Nonetheless, the EPQ has never 
caught on among American psychologists, who 
seem enamored of multiphasic instruments that 
produce 10, 20, or 30 scores, not a simple trio of 
basic dimensions. 


Comrey Personality Scales 


For practitioners who desire a short self-report in- 
ventory suitable for college students and other 
adults, the Comrey Personality Scales (Comrey, 
1970, 1980) would be a good choice. As a protégé 
of Guilford, Comrey pursued a factor-analytic strat- 
egy in developing his 180-item test. Comrey relied 
exclusively upon college students in the devel- 
opment and standardization of his test, so the CPS 
is well suited to assessment of personality in this 
subpopulation. 

A special virtue of the CPS is its brevity. Con- 
sisting of 180 statements, the test is only one-third 
as long. as competing instruments such as the 
MMPI-2. The eight CPS personality scales consist 
of 20 items each, divided equally between posi- 
tively and negatively worded statements. Another 
20 items are devoted to a validity check and the as- 
sessment of social desirability response bias. 

The following description of CPS scales is 
based upon Merenda (1985): 


(V) Validity Check. A score of 8 is the expected 
raw score. Any score on the V scale which 
gives a T-score equivalent below 70 is still 
within the normal range, however. Higher 
scores are suggestive of an invalid record. 

(R) Response Bias. High scores indicate a ten- 
dency to answer questions in a socially de- 
sirable way, making the respondent look like 
a “nice” person. 

(T) Trust versus Defensiveness. High scores in- 
dicate a belief in the basic honesty, trustwor- 
thiness, and good intentions of other people. 

(O) Orderliness versus Lack of Compulsion. 
High scores are characteristic of careful, 
meticulous, orderly, and highly organized in- 
dividuals. 

(C) Social Conformity versus Rebelliousness. 
Individuals with high scores accept society 
as it is, resent nonconformity in others, seek 
the approval of society, and respect the law. 

(A) Activity versus Lack of Energy. High-scor- 
ing individuals have a great deal of energy 
and endurance, work hard, and strive to excel. 

(S) Emotional Stability. versus Neuroticism. 
High-scoring persons are free from depres- 
sion, optimistic, relaxed, stable in mood, and 
confident. 

(E) Extraversion versus Introversion. High- 
scoring individuals meet people easily, seek 
new friends, feel comfortable with strangers, 
and do not suffer from stage fright. 

(M) Masculinity versus Femininity. High-scor- 
ing individuals tend to be rather tough-minded 
people who are not bothered by blood, crawl- 
ing creatures, vulgarity, and who do not cry 
easily or show much interest in love stories. 

(P) Empathy versus Egocentrism. High-scoring 
individuals describe themselves as helpful, 
generous, sympathetic people who are inter- 
ested in devoting their lives to the service of 
others. 


Reflecting its careful factor-analytic derivation, 
the CPS scales possess exceptional internal consis- 
tencies, which range from .91 to .96. These findings 
indicate that the CPS is most likely a reliable test, 





but traditional test-retest data are scant. Cross- 
cultural studies with the CPS are highly supportive 
of its validity. Brief and Comrey (1993) report that 
the eight-factor solution to CPS item responses is 
found in factor analyses with Russian, U.S., Brazil- 
ian, Israeli, Italian, and New Zealand samples. Other 
validational studies with the CPS are not straight- 
forward in their interpretation. On the one hand, the 
correlations between CPS scale scores and person- 
ality-relevant biographical data are very small (Com- 
rey & Backer, 1970; Comrey & Schiebel, 1983). On 
the other hand, extreme scores on the CPS scales are 
strongly associated with psychological disturbance 
(Comrey & Schiebel, 1985). This is particularly true 
for low scores on Trust versus Defensiveness, Ac- 
tivity versus Lack of Energy, Emotional Stability 
versus Neuroticism, Extraversion versus Introver- 
sion, and high scores on Orderliness versus Lack of 
Compulsion. Shen and Comrey (1997) describe the 
utility of the CPS with medical students, showing 
that the test is a reasonable predictor of clinical per- 
formance and personal suitability. In general, re- 
viewers conclude that the CPS is a promising test 
that needs updated standardization and additional 
documentation on its technical qualities. Comrey 
(1995) summarizes validity studies of his test. 


NEO Personality Inventory-Revised 


The NEO Personality Inventory-Revised (NEO PI-R) 
embodies decades of factor-analytic research with 
clinical and normal adult populations (Costa & 
McCrae, 1992). The test is based upon the five- 
factor model of personality described in the previ- 
ous chapter. It is available in two parallel forms 
consisting of 240 items rated on a five-point dimen- 
sion. An additional three items are used to check 
validity. A shorter version, the NEO Five-Factor 
Inventory (NEO-FFI) is also available (Costa & 
McCrae, 1989). We limit our discussion to the NEO 
PI-R. Form S is for self-reports whereas Form R is 
for outside observers (e.g., the spouse of a client). 
The item format consists of five-point ratings: 
strongly disagree, disagree, neutral, agree, strongly 
agree. The items assess emotional, interpersonal, 
experiential, attitudinal, and motivational variables. 
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The five domain scales of the NEO PI-R are each 
based upon six facet (trait) scales (Table 14.4). The 
internal consistency of the scales is superb: .86 to .95 
for the domain scales, and .56 to .90 for the facet 
scales. Stability coefficients range from .51 to .83 in 
three- to seven-year longitudinal studies. Validity ev- 
idence for the NEO PI-R is substantial, based upon 
the correspondence of ratings between self and 
spouse, correlations with other tests and checklists, 
and the construct validity of the five-factor model it- 
self (Costa & McCrae, 1992; Piedmont & Weinstein, 
1993; Trull, Useda, Costa, & McCrae, 1995). 

The NEO PI-R is an excellent measure of per- 
sonality that is especially useful in research. 
Rubenzer, Faschingbauer, and Ones (2000) de- 
scribe a particularly fascinating research project 
with the test in which all U.S. presidents were eval- 
uated by 115 highly informed, expert presidential 
biographers who filled out the NEO PI-R on behalf 
of the presidents, from George Washington through 
George H. W. Bush. The authors developed a ty- 
pology of presidents from the data and related 
facets of the test to presidential success (i.e., his- 
torical greatness). They also published individual 
presidential profiles, such as the following results 
for George Washington (50 is average in the gen- 
eral population): 


Neuroticism 47 
Extraversion 44 
Openness 39 
Agreeableness 40 
Conscientiousness 72 


The portrait that emerges is of a leader who is well- 
adjusted, slightly introverted, not particularly open 
to experience, markedly disagreeable, and ex- 
tremely conscientious. After reviewing the specific 
facet scores (see Table 14.4), the authors concluded 
that Washington “falls quite short of the modern 
political commodities of warmth, empathy, and 
open-mindedness.” 

The test also shows promise as a measure of 
clinical psychopathology. For example, Clarkin, 
Hull, Cantor, and Sanderson (1993) found that 
patients diagnosed with borderline personality dis- 
order scored very high on Neuroticism and very 
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TABLE 14.4 Domain and Facet (Trait) Scales of the NEO PI-R 





Domains 


Neuroticism 


Extraversion 


Openness to Experience 


Agreeableness 


Conscientiousness 


Facets 
Anxiety Self-Consciousness 
Angry Hostility Impulsiveness 
Depression Vulnerability 
Warmth Activity 
Gregariousness Excitement Seeking 
Assertiveness Positive Emotions 
Fantasy Actions 
Aesthetics Ideas 
Feelings Values 
Trust Compliance 
Straightforwardness Modesty 
Altruism Tender-Mindedness 
Competence Achievement Striving 
Order Self-Discipline 
Dutifulness Deliberation 





low on Agreeableness, which resonates strongly 
with every clinician’s response to these challeng- 
ing patients. Ranseen, Campbell, and Baer (1998) 
determined that 25 adults with attention deficit dis- 
order scored significantly higher than controls in 
the Neuroticism domain and significantly lower in 
the Conscientiousness domain, demonstrating the 
usefulness of the NEO PI-R in understanding at- 
tention deficit disorders in adulthood. One minor 
concern about the instrument is that it lacks sub- 
stantial validity scales—only three items assess va- 
lidity. The administration of the NEO PI-R assumes 
that subjects are cooperative and reasonably hon- 
est. This is usually a safe assumption in research 
settings but may not hold true in forensic, person- 
nel, or psychiatric settings. 


|) CRITERION-KEYED INVENTORIES 


The final self-report inventories that we will 
review embody a criterion-keyed test develop- 
ment strategy. In a criterion-keyed approach, test 


items are assigned to a particular scale if, and 
only if, they discriminate between a well-defined 
criterion group and a relevant control group. For 
example, in devising a self-report scale for de- 
pression, items endorsed by depressed persons 
significantly more (or less) frequently than by 
normal controls would be assigned to the depres- 
sion scale, keyed in the appropriate direction. A 
similar approach might be used to develop scales 
for other constructs of interest to clinicians such 
as schizophrenia, anxiety reaction, and the like. 
Notice that the test developer does not consult any 
theory of schizophrenia, depression, or anxiety 
reaction to determine which items belong on the 
respective scales. The essence of the criterion- 
keyed procedure is, so to speak, to let the items fall 
where they may.® 


6. We are glossing over certain complexities here. Some items 
reflecting general psychopathology might discriminate all the 
contrast groups from the control group. The test developer might 
discard these in favor of items that are differentially discrimi- 
nating for just one contrast group but not the others. 


Minnesota Multiphasic Personality 
Inventory-2 (MMPI-2) 


First published in 1943, the MMPI was a 566-item 
true-false personality inventory designed origi- 
nally as an aid in psychiatric diagnosis (Hathaway 
& McKinley, 1940, 1942, 1943: McKinley & 
Hathaway, 1940, 1944; McKinley, Hathaway, & 
Meehl, 1948). The test authors followed a strict 
empirical keying approach in the construction of 
the MMPI scales. The clinical scales were devel- 
oped by contrasting item responses of carefully de- 
fined psychiatric patient groups (average N of 
about 50) with item responses of 724 control sub- 
jects. The result was a remarkable test useful both 
in psychiatric assessment and the description of 
normal personality. Within a few years, the MMPI 
became the most widely used personality test in 
the United States. 

At first the MMPI aged gracefully; what ap- 


peared to be minor flaws were tolerated by practi-. 


tioners. But as the MMPI reached middle age, the 
need for rejuvenation became increasingly obvious. 
The most serious problem was the original control 
group, which consisted primarily of relatives and 
visitors of medical patients at the University of 
Minnesota Hospital. The narrow choice of control 
subjects, tested mainly in the 1930s, proved to be a 
persistent source of criticism for the MMPI. All of 
the control subjects were white, and most were 
young (average age about 35), married, and from a 
small town or rural area. This was a sample of con- 
venience that was significantly unrepresentative of 
the population at large. 

The item content of the MMPI also raised con- 
cerns (Graham, 1993). Several items used archaic 
and obsolete terminology, referring to “drop the 
handkerchief” (a parlor game from the 1930s), 
sleeping powders (sleeping pills), and streetcars 
(electric-powered buses). Other items used sexist 
language. Examinees found some items objection- 
able, especially those dealing with Christian reli- 
gious beliefs. These items were the source of 
occasional lawsuits alleging invasion of privacy. Fi- 
nally, a few items dealing with bowel functions and 
sexual behavior were just downright offensive. 
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From the standpoint of measurement, a more se- 
rious problem with item content was that of omis- 
sion. The MMPI item pool was not broad enough to 
assess many important characteristics, including 
suicidal tendencies, drug abuse, and treatment-re- 
lated behaviors. An additional motive for MMPI re- 
vision was to extend the range of item coverage. 

The MMPI-2 was released in 1989 after nearly 
a decade of revision and restandardization. The new, 
improved MMPI-2 incorporates a contemporary 
normative sample of 2,600 individuals who are 
loosely representative of the general population on 
major demographic variables (geographic location, 
race, age, occupational level, and income). Although 
higher educational levels are overrepresented, the 
MMPI-2 normative sample is still a vast improve- 
ment over the MMPI normative sample. The item 
pool has been significantly improved by revision of 
obsolete items, deletion of offensive items, and ad- 
dition of new items to extend content coverage. 

The MMPI-2 is a significant improvement upon 
the MMPI, but maintains substantial continuity with 
its esteemed predecessor. The test developers re- 
tained the same titles and measurement objectives 
for the traditional validity and clinical scales. The re- 
standardization provides a better calibration for scale 
elevations, a much-needed improvement (Tellegen 
& Ben-Porath, 1992). Although dozens of items 
were rewritten, most of these revisions are cosmetic 
and do not affect the psychometric characteristics of 
the test (Ben-Porath & Butcher, 1989). In fact, when 
large samples of subjects complete the MMPI and 
the MMPI-2, scores on the individual validity and 
clinical scales typically correlate near .99. 

The MMPI-2 consists of 567 items carefully 
designed to assess a wide range of concerns. The 
examinee is asked to mark “true” or “false” for each 
statement as it applies to himself or herself. Most 
of the items are self-referential. The items encom- 
pass these general themes (Dahlstrom, Welsh, & 
Dahlstrom, 1972; Graham, 1993): 


health concerns 
neurologic symptoms 
family problems 
marital relations 
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work problems 

depressive and manic symptoms 

obsessive and compulsive states 

delusions and hallucinations 

anxiety and phobias 

anger control 

antisocial practices 

alcohol and drug abuse 

self-esteem 

social discomfort 

Type A behavior 

masculinity-femininity 

negative treatment indicators 

response validity, for example, improbable 

virtues 
The MMPI requires a sixth-grade reading levei and 
is completed by most persons in 1 to 1% hours. 

The original MMPI scales were developed by 
contrasting item responses of carefully defined psy- 
chiatric patient groups (average N of about 50) with 
item responses of about 700 controls. The psychi- 
atric patient groups included the following diag- 
nostic categories: hypochondriasis, depression, 
hysteria, psychopathy, male homosexuality, para- 
noia, psychasthenia,’ schizophrenia, and the early 
phase of mania (hypomania). In addition, samples 
of socially introverted and socially extraverted col- 
lege students were used to construct a scale for so- 
cial introversion. The MMPI-2 retains the basic 
clinical scales with only minor item deletions and 
revisions. Ben-Porath and Butcher (1989) investi- 
gated the characteristics of the rewritten items on 
the MMPI-2 and discovered that they are psycho- 
metrically equivalent to the original items. 

The MMPI-2 can be scored for four validity 
scales, 10 standard clinical scales, and dozens of 
supplementary scales. In practice, clinicians place 
the greatest emphasis upon the validity and stan- 
dard clinical scales. The supplementary scales are 
just that—supplementary. They provide informa- 
tion helpful in fine-tuning the interpretation of the 
traditional validity and clinical scales. MMPI-2 


7. This outdated diagnostic term is quite similar to what would 
now be labeled obsessive-compulsive disorder. 


scale raw scores are converted to T scores, with a 
mean of 50 and a standard deviation of 10. Scores 
that exceed T of 65 merit special consideration. 
These elevated scores are statistically uncommon 
in the general population and may signify the pres- 
ence of psychiatric symptomatology. We will con- 
centrate upon the traditional scales here, beginning 
with a review of the four validity scales, known as 
Cannot Say (or ?), L, F, and K. 

The Cannot Say score is simply the total num- 
ber of items omitted or double-marked in comple- 
tion of the answer sheet. The instructions for the 
test encourage examinees to mark all items, but 
omissions or double-marked items will occur. 
However, this is rare—the modal number of items 
omitted is zero (Tamkin & Scherer, 1957). Omis- 
sion of up to 10 items appears to have little effect 
on the overall test results—one of the benefits of 
having a huge pool of statements in the MMPI-2. 
A very high score on this scale may indicate a read- 
ing problem, opposition to authority, defensive- 
ness, or indecisiveness caused by depression. 

The L Scale is composed of 15 items all scored 
in the false direction. By answering “false” to L 
Scale items, the examinee asserts that he or she pos- 
sesses a degree of personal virtue that is rarely ob- 
served in our culture (e.g., never gets angry, likes 
everyone, never lies, reads every newspaper edito- 
rial, and would rather lose than win). The L Scale 
was designed to identify a general, deliberate, eva- 
sive test-taking attitude. A high score on the L Scale 
indicates that the examinee is not only defensive, 
but naively so. Persons with any degree of psycho- 
logical sophistication can adopt a defensive test- 
taking attitude and still score in the normal range 
on the L Scale. 

The F Scale consists of 60 items answered by 
normal subjects in the scored direction no more 
than 10 percent of the time. These items reflect a 
broad spectrum of serious maladjustment, includ- 
ing peculiar thoughts, apathy, and social alienation. 
Even though F Scale items seem to indicate psy- 
chiatric pathology, they are seldom endorsed by pa- 
tients. Fewer than 50 percent of these items appear 
on the clinical scales. Many persons with signifi- 
cant psychiatric disturbance do produce elevated 


scores in the range of T = 70 or 80 on the F Scale. 
On the other hand, exceptionally high scores sug- 
gest additional hypotheses: insufficient reading 
ability, random or uncooperative responding, a mo- 
tivated attempt to “fake bad” on the test, or an ex- 
aggerated “cry for help” in a distressed client. 

The K Scale was designed to help detect a sub- 
tle form of defensiveness. The 30-item scale is 
composed, in part, of 22 items that differentiated 
normal profiles produced by defensive hospitalized 
psychiatric patients from those produced by normal 
controls. Additionally, eight items that improved 
discrimination of depressive and schizophrenic 
symptoms were added (McKinley, Hathaway & 
Meehl, 1948). An elevated score on the K Scale 
may indicate a defensive test-taking attitude. Nor- 
mal range elevations on the K Scale suggest good 
ego strength—the presence of useful psychological 
defenses that allow the person to function well in 
spite of internal conflict. 

The combined use of F and K may be useful in 
the detection of MMPI-2 profiles that have been 
faked or malingered. In one study, 81 percent of 
fake-good profiles were identified by a simple de- 
cision rule (using raw scores) of F-K < -12, 
whereas 87 percent of fake-bad profiles were iden- 
tified by a simple decision rule (using raw scores) 
of F-K > 7 (Bagby, Rogers, Buis, & Kalemba, 
1994). Studies in which subjects are coached on 
how to fake the MMPI-2 raise troubling ethical 
concerns (Ben-Porath, 1994; Wetter, Baer, Berry, 
& Reynolds, 1994). 

Several clinical scales are “K-corrected” to im- 
prove their discriminatory power. The rationale for 
this practice is that elevations on K betoken an ar- 
tificial reduction of scores on these clinical scales. 
Portions of the raw score on K are thus added to 
these clinical scale scores prior to computation of 
the T scores. The K-corrected scales, discussed 
later, include Hypochondriasis, Psychopathic De- 
viate, Psychasthenia, Schizophrenia, and Hypoma- 
nia. Whether K correction actually improves the 
MMPI-2 is debatable, but the test publishers con- 
tinued the tradition from the MMPI for the sake of 
continuity. Separate norms for non-K-corrected 
scale score transformations are also available. 
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In addition to the validity scales, the MMPI-2 
is always scored for 10.clinical scales. With the ex- 
ception of Social Introversion, these clinical scales 
were constructed in the usual criterion-keyed man- 
ner by contrasting responses of clinical subjects 
and normal controls. As noted previously, Social 
Introversion was developed by contrasting the re- 
sponses of college students high and low in social 
introversion. The 10 clinical scales and common in- 
terpretations of elevated scores are outlined in 
Table 14.5. 

Dozens of supplementary scales can also be 
scored on the MMPI-2. Some of the supplementary 
scales are based upon rational identification of 
symptom clusters and subsequent scale purification 
by empirical means. Fifteen useful MMPI-2 Con- 
tent Scales were developed in this manner (Butcher, 
Graham, Williams, & Ben-Porath, 1990). Many of 
the supplementary scales were developed by inde- 
pendent investigators; these scales vary widely in 
quality. In practice, only about 30 of the additional 
scales are routinely scored. Examples of the sup- 
plementary scales include Anxiety, Repression, 
Ego Strength, and the MacAndrew Alcoholism 
Scale-Revised. Anxiety (A) and Repression (R) are 
the first two major factors that always emerge from 
factor analysis of MMPI-2 responses. An interest- 
ing supplementary scale is Barron’s (1953) Ego 
Strength (Es) Scale, which purports to predict pos- 
itive response to psychotherapy. However, not all 
studies confirm this use of the scale (Graham, 
1987). The MacAndrew Alcoholism Scale-Revised 
(MAC-R; MacAndrew, 1965) is a useful index of 
alcohol or other substance abuse. The MAC-R is 
not only useful in assessment of alcoholism but is 
also helpful in the identification of heavy drinkers 
and drug-dependent individuals (Wolf, Schubert, 
Patterson, Grande, & Pendleton, 1990). We cannot 
possibly review all the useful supplementary scales 
here. The interested reader should consult Butcher 
and Williams (1992) and Graham (1993). 


MMPI-2 Interpretation 


The interpretation of an MMPI-2 profile can pro- 
ceed along two different paths: scale by scale or 
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TABLE 14.5 The 10 Clinical Scales from the Minnesota Multiphasic Personality 





Inventory-2 

Scale No. and Typical Interpretation 

Abbreviation Scale Name K Correction of Elevation 

1 Hs Hypochondriasis „5K Excessive physical 
preoccupation 

21D: Depression Sad feelings, 
hopelessness 

3 Hy Hysteria Immaturity, use of 
repression, denial 

4 Pd Psychopathic deviate AK Authority conflict, 

; impulsivity 

5 Mf Masculinity-femininity Masculine interests 
[women], Feminine 
interests [men] 

6 Pa Paranoia Suspiciousness, hostility 

7 Pt Psychasthenia 1K Anxiety and obsessive 
thinking 

BR. Sc Schizophrenia 1K Alienation, unusual 
thought processes 

9 Ma Hypomania 2K High energy, possible 
agitation 

0 Si Social introversion Shyness and introversion 





configural. In the simplest possible approach, scale 
by scale, the examiner determines the validity of the 
test, as discussed previously, by inspecting the four 
validity scales. If the test appears reasonably valid 
by these criteria, the examiner consults a relevant 
resource book and proceeds scale by scale to pro- 
duce a series of hypotheses. For example, Lachar 
(1974) has distilled the meaning of various eleva- 
tions on the Pa or Paranoia scale as follows: 


T = 27-44 examinee may be stubborn, touchy, 
or difficult 

T=45-59 no undue sensitivity and adequate re- 
gard for others 

T= 60-69 increasing probability of rigidity and 
oversensitivity 

T = 70-79 rigid, touchy, projects blame and 
hostility 

T = 79-100 frankly delusional paranoid fea- 
tures may be present 


The configural approach to MMPI-2 interpreta- 
tion is somewhat more complicated and consists of 


classifying the profile as belonging to one or another 
loosely defined code type that has been studied ex- 
tensively. Code types are usually defined by a com- 
bination of elevation (two or more clinical scales 
elevated beyond a certain criterion) and definition 
(two or more clinical scales clearly standing out 
from the others). For example, in its full-blown 
manifestation, the 4-9 code type can be defined by 
a valid profile in which scale 4 (Psychopathic De- 
viate) and scale 9 (Hypomania) are the high-point 
elevations, both exceed T of 65 (elevation), and both 
exceed the next highest clinical scale by at least 5 T- 
score points (definition). Here is how Graham 
(1993) describes persons who fit this code type: 


The most salient characteristics of 49/94 individu- 
als is a marked disregard for social standards and 
values. They frequently get in trouble with the au- 
thorities because of antisocial behavior. They have 
a poorly developed conscience, easy morals, and 
fluctuating ethical values. Alcoholism, fighting, 
marital problems, sexual acting out, and a wide 
array of delinquent acts are among the difficulties 


in which they may be involved. This is acommon 
code type among persons who abuse alcohol and 
other substances. 


The most likely diagnosis for such individuals is 
antisocial personality disorder. 

We should mention briefly that several comput- 
erized interpretation systems are available for the 
MMPI and the MMPI-2 (Fowler, 1985; Butcher, 
1987). The Minnesota Report™ (Butcher, 1993) is 
the best. This system generates a very cautious and 
methodical 16-page report that includes discussion 
of profile validity, symptomatic patterns, interper- 
sonal relations, diagnostic considerations, and treat- 
ment considerations. The Minnesota Report™ also 
provides a variety of figures and tables to illustrate 
test results. 

The adequacy of computerized MMPI-2 narra- 
tive reports is generally good, but the reader should 
realize that computer programs are written by fal- 
lible human beings. There is a danger that com- 
puter-generated test reports will be erroneous. 
Furthermore, some less-reputable interpretive sys- 
tems can be purchased on microcomputer diskette 
for a few hundred dollars. This increases the risk 
that computer-based test interpretations will be 
misused by unqualified persons. We discuss the pit- 
falls of computerized test interpretation in the final 
chapter of the book. 


Technical Properties of the MMPI-2 


From the standpoint of traditional psychometric 
criteria, the MMPI-2 presents a mixed picture. Re- 
liability data are generally positive, with median in- 
ternal consistency coefficients (alpha) typically in 
the .70s and .80s, but as low as the .30s for some 
scales in some samples. One-week test-retest coef- 
ficients range from the high .50s to the low .90s, 
with a median in the .80s (Butcher, Dahlstrom, 
Graham, Tellegen, & Kaemmer, 1989). These are 
good figures considering that some attributes— 
such as those measured by the Depression scale— 
change so quickly that the test-retest methodology 
is of questionable suitability. 

A shortcoming of the MMPI-2 is that intercor- 
relations among the clinical scales are extremely 
high. For example, in the case of scales 7 and 8, the 
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Psychasthenia and Schizophrenia scales, the corre- 
lation is commonly in the .70s. In part, this reflects 
the item overlap between MMPI scales—scales 7 
and 8 share 17 items in common. But it is also true 
that the criterion-keyed approach is not well suited 
to the development of independent measures. A 
high intercorrelation of basic scales is one price to 
be paid for using this test development strategy. 

The validity of the MMPI-2 is difficult to sum- 
marize, owing to the sheer volume of research on 
this instrument and its predecessor, the MMPI. As 
of 1975, over 6,000 studies employing the MMPI 
had been completed (Dahlstrom, Welsh, & 
Dahlstrom, 1975). Of course, thousands of addi- 
tional studies have been published since then. Gra- 
ham (1993) provides a brief but excellent review of 
validity studies on the MMPI/MMPI-2. He notes 
that the average validity coefficient for MMPI stud- 
ies conducted between 1970 and 1981 was a 
healthy .46. He also points out the confirming pat- 
tern of extratest correlates in dozens of studies of 
identified patient groups. Research also indicates 
that the MMPI-2 is highly comparable to the 
MMPI, for which a substantial body of validity data 
has been compiled (Hargrave, Hiatt, Ogard, & Karr, 
1994; Harrell, Honaker, & Parnell, 1992). Finally, 
bias studies comparing MMPI-2 results for cau- 
casian and African American clients indicate that 
slight racial differences do exist in average profiles. 
However, these differences validly reflect emo- 
tional functioning; that is, the MMPI-2 is not 
racially biased (McNulty, Graham, Ben-Porath, & 
Stein, 1997). 

The MMPI/MMPI-2 has been shown to be of 
value for a wide range of diagnostic and treatment 
problems, including the assessment of antisocial, 
borderline, and narcissistic personality disorders 
(Castlebury, Hilsenroth, Handler, & Durham, 
1997), the evaluation of sexual abuse history and 
sexual orientation in women (Griffith, Myers, 
Cusick, & Tankersley, 1997), the prediction of out- 
come from surgery for back pain (Uomoto, Turner, 
& Herron, 1988), the empirical classification of 
homicide offenders (Kalichman, 1988), the treat- 
ment of individuals with HIV and AIDS (Inman, 
Esther, Robertson, Hall, & Robertson, 2002), and 
the treatment of criminal offenders (Forbey & 
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Ben-Porath, 2002), to cite just a few examples. The 
MMPI-2 likely will maintain its status as the pre- 
miere instrument for assessment of psychopathol- 
ogy in adulthood for many years to come. 


California Psychological Inventory (CPI) 


Originally published in 1957, the recently revised 
CPI is an MMPI-like instrument designed expressly 
to measure the dimensions of normal personality 
(Gough, 1987; McAllister, 1986). The test consists 
of 462 true-false items, including nearly 200 items 
borrowed directly from the MMPI. The revised CPI 
yields scores on 20 scales, including three measures 
designed to assess test-taking attitudes. These va- 
lidity scales are Sense of well-being (Wb), derived 
from normals asked to “fake bad”; Good impression 
(Gi), derived from normals asked to “fake good”; 
and Communality (Cm), which consists of items en- 
dorsed by 95 percent of normals. Thus, Wb, Gi, and 
Cm are designed, respectively, to detect subjects 
who fake bad, fake good, or respond randomly. 

The clinical scales are based on “folk” concepts 
of personality; they measure dimensions of per- 
sonality that are meaningful and easily recognized 
by laypersons and psychologists alike. For 13 of the 
17 clinical scales, Gough used a criterion-keyed ap- 
proach in test development. Extreme groups of sub- 
jects (mainly college students) were formed on 
such scale-relevant criteria as school grades, socia- 
bility, and participation in extracurricular activities. 
Item endorsement frequencies were then contrasted 
to ferret out the best statements for each scale. The 
remaining four scales were constructed on a ratio- 
nal basis backed up by indices of internal consis- 
tency. The CPI scales are described in Table 14.6. 

Reflecting the care with which the scales were 
constructed, reliability data for the CPI are quite 
satisfactory. Most test-retest scale correlations are 
in the .80s, with a range from about the .50s to the 
low .90s, depending upon the sample and the time 
interval between administrations. The CPI is also 
well standardized; scores are based on a normative 
sample of 6,000 males and 7,000 females of widely 
varying age, social class, and geographic region. 
All scale scores are reported as T scores, with a 
mean of 50 and standard deviation of 10. 


The CPI scales were cross-validated to deter- 
mine their ability to discriminate new samples of 
subjects rated on the relevant dimension. These va- 
lidity coefficients are quite mixed; some values are 
acceptable, but other criterion correlations are very 
low. For example, scores on the Flexibility scale 
typically correlate only about .2 with staff ratings 
of flexibility (Domino, 1987). 

The CPI is heir to a long history of empirical re- 
search that substantiates a number of real-world 
correlates for distinctive test profiles. Due to space 
limitations, we can only list several prominent 
areas in which the value of the test has been em- 
pirically confirmed. The CPI is useful for helping 
predict the following: 


e high school and college achievement 

e effectiveness of student-teachers 

e grade point average in medical school 

e effectiveness of police and military personnel 
e leadership and executive success 


The CPI is particularly effective at identifying ado- 
lescents or adults who follow a delinquent or crim- 
inal life style (Gough & Bradley, 1992b). The 
reader can find further details on the real-world 
empirical correlates of CPI profiles in Groth- 
Marnat (1990) and Hargrave and Hiatt (1989). 


Millon Clinical Multiaxial Inventory-Ill (MCMI-II) 


The MCMI-II is a personality inventory designed 
for the same purposes as the MMPI-2, namely, to 
provide useful information for psychiatric diagno- 
sis (Millon, 1983, 1987, 1994). The MCMI-II has 
two advantages over the MMPI-2. First, it is much 
shorter (175 true-false items) and therefore more 
palatable to clinical referrals; second, it is planned 
and organized to identify clinical patterns in a man- 
ner that is compatible with the Diagnostic and 
Statistical Manual (DSM-IV) of the American Psy- 
chiatric Association. 

The MCMI-II is a highly theory-driven test, 
incorporating Millon’s elaborate theoretical formu- 
lations on the nature of psychopathology and per- 
sonality disorder (Millon, 1969, 1981, 1986; Millon 
& Davis, 1996). The test includes 27 scales, listed in 
Table 14.7. The first 11 scales measure personality 


TABLE 14.6 Brief Description of California Psychological Inventory Scales 


Scale 


Dominance 

Capacity for Status 

Sociability 

Social Presence 
Self-Acceptance 

Independence 

Sense of Well-being 

Empathy 

Responsibility 

Socialization 

Self-Control 

Tolerance 

Good Impression 
Communality 

Achievement via Conformance 
Achievement via Independence 
Intellectual Efficiency 
Psychological-Mindedness 


Flexibility 
Femininity 


Common Interpretation of High Score 


dominant, persistent, good leadership ability 

personal qualities that underlie and lead to status 
outgoing, sociable, participative temperament 

poise, spontaneity, and self-confidence in social situations 
self-acceptance and sense of personal worth 

high sense of personal independence, not easily influenced 
not worrying or complaining, free from self-doubt 

good capacity to empathize with other persons 
conscientious, responsible, and dependable 

strong social maturity and high integrity 

good self-control, freedom from impulsivity and self-centeredness 
permissive, accepting, and nonjudgmental social beliefs 
concerned about creating a good impression 

valid and thoughtful response pattern 

achieves well in settings where conformance is necessary 
achieves well in settings where independence is necessary 
high degree of personal and intellectual efficiency 


interested in and responsive to the inner needs, motives, and experi- 
ences of others 


flexible and adaptable in thought and social behavior 
high degree of feminine interests 





Source: Based on Gough, H. G. (1987). California Psychological Inventory manual. Palo Alto, CA: Consulting Psychologists Press. Also, 
Megargee, E. (1972). The California Psychological Inventory handbook. San Francisco: Jossey-Bass. 


TABLE 14.7 Scales of the Millon Clinical Multiaxial Inventory-III 


Clinical Personality Patterns 
1 Schizoid 

2A Avoidant 

2B Depressive 

3 Dependent 

4 Histrionic 

5 Narcissistic 

6A Antisocial 

6B Aggressive (Sadistic) 
g Compulsive 

8A Passive-Aggressive (Negativistic) 
8B Self-Defeating 
Severe Personality Pathology 
S Schizotypal 

C Borderline 

P Paranoid 


Clinical Syndromes 

Anxiety 

Somatoform 

Bipolar: Manic 

Dysthymia 

Alcohol Dependence 
Post-Traumatic Stress Disorder 


AWOZTS 


Severe Syndromes 

ss Thought Disorder 
CC Major Depression 
PP Delusional Disorder 


Validity (Modifying) Indices 
X Disclosure 

T Desirability 

Z Debasement 
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styles or traits such as narcissism and antisocial ten- 
dencies; the next three assess more severe person- 
ality pathology (schizotypal, borderline, and 
paranoid disorders); the following seven scales as- 
sess clinical syndromes such as anxiety and de- 
pression; the next three scales assess severe clinical 
syndromes such as thought disorder; the last three 
scales are validity (response style) indices. Scores 
on these scales (Disclosure, Desirability, and De- 
basement) are used to adjust the other scale scores 
upward or downward, based on defensiveness or 
exaggeration of symptoms, respectively. 

Scale development for the MCMI-III and its 
precursors was careful and methodical. We can 
only portray the broad outline here, in which 3,500 
initial items were culled to 175 statements in three 
stages of test development: a theoretical-substan- 
tive stage (theory-guided item writing), an internal- 
structural stage (item-scale correlations), and an 
external-criterion stage (contrast of diagnostic 
groups with the reference group). A special feature 
of the last stage was Millon’s use of general psy- 
chiatric patients instead of normal controls as the 
reference group. The purpose of this strategy was 
to enhance the capacity of MCMI scales to differ- 
entiate specific diagnostic groups from one another. 
Unfortunately, one side effect of this particular cri- 
terion-keyed approach was a rather substantial de- 
gree of item overlap for the clinical scales. Millon 
planned for and expected the item overlap, but 
probably did not anticipate that some pairs of scales 
on the MCMI would share the majority of their 
items in common. Some of this overlap was elimi- 
nated with the further refinement of the test for the 
second and third editions. The revised instrument 
also incorporates an item-weighting procedure. In 
this approach, individual questions are weighted 2 
or 1 to reflect their importance in discriminating the 
prototype for each scale. The item-weighting ap- 
proach has been criticized as unnecessary and un- 
wieldy (Streiner, Goldberg, & Miller, 1993). 

The normative sample for the MCMI-II con- 
sisted of about a thousand men and women patients 
from across the United States. This is an unusual 
and controversial approach to the collection of 
a normative sample. More typically, population- 
proportionate sampling of reasonably normal indi- 


viduals is used. Millon offers the arguable justifica- 
tion that a patient sample is adequate for the norma- 
tive sample because the base rates (in the general 
population) for specific personality and clinical dis- 
orders were consulted to calibrate the cutting points 
on the individual scales (Millon & Davis, 1996). 
But this approach is complex, experimental, and 
difficult to understand. The reliability of the indi- 
vidual scales is good: Internal consistency coeffi- 
cients average .82 to .90, and test-retest coefficients 
for one week range from .81 to .87. Support for the 
validity of the MCMI-II] is mixed (Haladyna, 1992; 
Piersma & Boes, 1997). 

The MCMI-II is unlikely to replace the 
MMPI-2, which is better suited to the diagnosis of 
acute clinical syndromes. However, the MCMI-II 
shows promise in the diagnosis of personality dis- 
orders and therefore can supplement the MMPI-2 
in this capacity (Antoni, 1993). Nonetheless, sev- 
eral recent independent studies have called into 
question the value of the MCMI-II and especially 
its precursors in determining diagnoses within the 
framework of the Diagnostic and Statistical Man- 
ual of Mental Disorders (Smith, Carroll, & Fuller, 
1988; Patrick, 1988; McCann & Suess, 1988). 
Also, Choca, Shanley, Peterson, and Van Denburg 
(1990) found evidence of racial bias against 
African Americans on the MCMI. Morgan, Shoen- 
berg, Dorr, and Burke (2002) raise concerns that the 
Disclosure index of the MCMI-III (designed to de- 
tect overreporting of symptoms) is not sensitive to 
clients who grossly exaggerate their problems. 
Reynolds (1992) captures what must be the shared 
opinion of many clinicians when he describes the 
MCMI as “a conceptual gem and psychometrically 
somewhere between a nightmare and an enigma.” 
For other reviews, see Dana and Cantrell (1988) 
and Overholser (1990). Craig (1993) has assembled 
a series of articles that are largely supportive of the 
MCMI. Jankowski (2002) provides a beginner’s 
guide to the test. 


Personality Inventory for Children-2 (PIC-2) 


The PIC-2 (Lachar & Gruber, 2001) is a substan- 
tial revision of the PIC-R, a popular instrument that 
dates back to the late 1950s (Wirt & Broen, 1958; 


Wirt, Lachar, Klinedinst, & Seat, 1984). The cur- 
rent version, suitable for children 5 through 19 
years of age, consists of 275 true-false statements 
that are completed by a parent or parental surro- 
gate. The PIC-2 is one corner of a triad of instru- 
ments developed by David Lachar and colleagues 
to provide a comprehensive, multiview perspective 
on children’s emotional and behavioral adjustment 
in the home, school, and community. The comple- 
mentary instruments are the Personality Inventory 
for Youth (PTY), which is filled out by the child, and 
the Student Behavior Survey (SBS), which is filled 
out by the teacher. We discuss only the PIC-2 here. 
Items on the PIC-2 resemble the following: 


My child finds it difficult to fall asleep. 

My child is a finicky eater. 

My child has threatened to kill himself (herself). 
Sometimes my child swears at other adults. 
Our marriage has been full of turmoil. 


The instrument also provides a shorter 96-item ver- 
sion known as the Behavioral Summary, suitable 
for screening and research purposes. 

The test developers of the PIC-2 followed a 
complex multistage methodology to assign individ- 
ual items to scales and subscales. The goal was to 
minimize content overlap between scales and sub- 
scales by examining preliminary item x subscale 
correlations and then retaining only those items for 
each specific subscale that showed high correla- 
tions. As a consequence of this test development 
strategy, each subscale possesses homogeneous 
content and the individual statements correlate sub- 
stantially with one another. The resulting instrument 
consists of three response validity scales (Incon- 
sistency, Dissimulation, Defensiveness) and nine 
adjustment scales. Each of the adjustment scales 
includes two or three subscales (Table 14.8). 

Scale raw scores are converted to T scores with 
a mean of 50 and standard deviation of 10. Higher 
T scores indicated increased probability of psy- 
chopathology or deficit. Norms for children ages 5 
through 19 years of age are based on a nationally 
representative sample of 2,306 parents of boys and 
girls in kindergarten through twelfth grade. 

With the possible exception of the three valid- 
ity scales (Inconsistency, Dissimulation, and De- 
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TABLE 14.8 Adjustment Scales and Subscales 
of the Personality Inventory for Children-2 


Adjustment Scales Subscales 
Cognitive Impairment Inadequate Abilities 
Poor Achievement 
Developmental Delay 
Impulsivity and Disruptive Behavior 
Distractibility Fearlessness 
Delinquency Antisocial Behavior 
Dyscontrol 
Noncompliance 
Family Dysfunction Conflict among Members 


Parent Maladjustment 


Reality Distortion Developmental Deviation 


Hallucinations and 

Delusions 
Psychosomatic Preoccupation 
Muscular Tension and Anxiety 


Somatic Concern 


Psychological Fear and Worry 
Discomfort Depression 
Sleep Disturbance/Death 
Preoccupation 
Social Withdrawal Social Introversion 
Isolation 
Social Skills Deficits Limited Peer Status 


Conflict with Peers 





fensiveness), the PIC-2 scale and subscale names 
are self-explanatory. The validity scales are (1) In- 
consistency, which includes 35 similar pairs of 
items to determine consistency of responding; 
(2) Dissimulation, a 35-item scale designed to iden- 
tify deliberate exaggeration (fake bad) about symp- 
toms or random responding; and (3) Defensiveness, 
a 24-item scale consisting of improbable virtues 
(e.g., “my child never has any problems”) and 
therefore an index of naive defensiveness. 

The reliability of PIC-R scales and subscales is 
good, with test-retest values in the range of .82 to 
.92 and internal consistency coefficients in the range 
of .81 to .92. The test manual (Lachar & Gruber, 
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2001) summarizes a huge body of criterion-related 
validity studies such as correlations with indepen- 
dent ratings from clinicians. These correlations are 
very strong for similar behavioral dimensions (and 
weak for dissimilar behavioral dimensions), thus 
supporting the validity of individual scales and sub- 
scales. In like manner, PIC-2 subscale scores show 
theory-consistent relationships with the DSM-IV 
diagnostic categories of clinic-referred children. 
For example, 63 children independently diagnosed 
with Oppositional Defiant Disorder showed highly 


elevated scores (average T scores of 75 to 80) on the 
following PIC-2 subscales: Disruptive Behavior, 
Fearlessness, Dyscontrol, and Noncompliance. 
This is a perfect match to the major clinical features 
of this DSM-IV diagnostic category. Overall, the 
test developers have cited an impressive body of re- 
search that supports the reliability and validity of 
their instrument. Although independent studies of 
this test are yet to be published, it seems clear that 


. the PIC-2 will earn wide usage in the behavioral 


and emotional assessment of school-aged children. 


SUMMARY 


1. Theory-guided self-report inventories rely 
upon explicit personality theories for their devel- 
opment. A good example of a theory-guided inven- 
tory is the Edwards Personal Preference Schedule 
(EPPS), a 210-item forced-choice instrument that 
attempts to measure Murray’s manifest needs by 
self-report. 


2. Jackson’s Personality Research Form (PRF) 
is also based upon Murray’s need system. The 20 
personality scales on the PRF possess no item over- 
lap and show exceptional internal consistency 
(median of .92). PRF validity is buttressed by con- 
firmatory factor analysis and appropriate correla- 
tions with similar scales on other instruments. 


3. The Myers-Briggs Type Indicator (MBTI) 
is a forced-choice self-report inventory based 
loosely upon Carl Jung’s theory of personality 
types. The MBTI is scored for four dimensions: Ex- 
traversion-Introversion, Sensing-iNtuition, Think- 
ing-Feeling, and Judging-Perceptive, yielding 16 
different types, such as ENFP. 


4. The Jenkins Activity Survey (JAS) is a 52- 
item multiple-choice questionnaire designed to 
identify the Type A coronary-prone behavior pat- 
tern. The three subscales include: Speed and Impa- 
tience, Job. Involvement, and Hard-Driving and 
Competitiveness. The JAS has several limitations 
(e.g., unrepresentative norms, scoring complexi- 
ties) and is therefore best suited to research. 


5. A short, simple test that has received high 
marks for technical merit is the State-Trait Anxiety 
Inventory (STAI). The 40 items of the STAI are 
each rated on a four-point intensity scale. The STAI 


. measures state anxiety, or transitory feelings of fear 


or worry; and trait anxiety, the relatively stable ten- 
dency to respond anxiously to stressful situations. 


6. Cattell’s Sixteen Personality Factor Ques- 
tionnaire (16PF) is typical of factor-analytically 
derived instruments. The five forms of the 16PF 
(for different age groups) all encompass a forced- 
choice format. The 16 surveyed personality attrib- 
utes (and four higher-order dimensions) have been 
repeatedly confirmed by factor analysis. 


7. The Eysenck Personality Questionnaire 
(EPQ) proposes three major factor-analytically de- 
rived dimensions of personality: Psychoticism, Ex- 
traversion, and Neuroticism. Scale reliabilities are 
quite strong and the construct validity of the in- 
strument is supported by dozens of studies. 


8. The Comrey Personality Scales embody a 
short self-report instrument suitable for college stu- 
dents. The eight CPS scales consist of 20 items 
each and possess no overlap. The scales show ex- 
cellent internal consistency. Extreme scores are es- 
pecially predictive of psychological disturbance. 


9. The NEO Personality Inventory-Revised 
(NEO PI-R) is based upon the five-factor model of 
personality described earlier. The five constructs 





measured by the test are Neuroticism, Extraversion, 
Openness to Experience, Agreeableness, and Con- 
scientiousness. The NEO PI-R is available in two 
parallel forms consisting of 240 items rated on a 
five-point dimension. 


10. The MMPI-2 consists of 567 true-false 
questions. The test is scored for four validity scales 
(?, L, F, and K) that assess unanswered questions, 
naive defensiveness, deviant responses, and subtle 
defensiveness, respectively. The 10 clinical scales 
are Hypochondriasis, Depression, Hysteria, Psy- 
chopathic Deviate, Masculinity-Femininity, Para- 
noia, Psychasthenia, Schizophrenia, Hypomania, 
and Social Introversion. 

11. The California Psychological Inventory 
(CPI) is an MMPI-like instrument designed to mea- 
sure the dimensions of normal personality. Three 
scales measure test-taking attitudes (e.g., “fake 
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good” and “fake bad” tendencies). The 17 clinical 
scales are based upon “folk” concepts of personal- 
ity easily recognized by laypersons. 

12. The Millon Clinical Multiaxial Inventory, 
now in its third edition (MCMI-II) is a short test 
(175 true-false items) designed as an aid to psychi- 
atric diagnosis. The 27 scales are organized into four 
broad categories relevant to DSM-IV: clinical per- 
sonality patterns, severe personality pathology, clin- 
ical syndromes, and severe clinical syndromes. 


13. Designed to provide clinically relevant de- 
scriptions of child behavior and family characteris- 
tics, the Personality Inventory for Children-2 
(PIC-2) consists of 275 true-false statements that 
are completed by a parent or parental surrogate. 
The test is suitable for children 5 through 19 years 
of age and yields scores on 9 adjustment scales and 
21 subscales. 


KEY TERMS AND CONCEPTS 


social desirability response set p. 520 
ipsative test p. 522 
state anxiety p.526 


trait anxiety p. 526 
extraversion p. 529 
introversion p. 529 


Torıc14B Behavioral Assessment 
and Related Approaches 


Foundations of Behavior Therapy 

Behavior Therapy and Behavioral Assessment 
Assessment of Nonverbal Behavior 
Ecological Momentary Assessment 


Summary 
Key Terms and Concepts 


I: this topic the reader will encounter a variety of 
straightforward, innovative, and occasionally 
nontraditional approaches to personality evaluation 
collectively known as behavioral assessment. Be- 
havioral assessment concentrates on behavior it- 
self rather than on underlying traits, hypothetical 
causes, or presumed dimensions of personality. The 
many methods of behavioral assessment offer a 
practical alternative to projective tests, self-report 
inventories, and other unwieldy techniques aimed 
at global personality assessment. 

Typically, behavioral assessment is designed 
to meet the needs of therapists and their clients in 
a quick and uncomplicated manner. But behav- 
ioral assessment differs from traditional assess- 
ment in more than its simplicity. The basic 
assumptions, practical aspects, and essential goals 
of behavioral and traditional approaches are as 
different as night and day. Traditional assessment 
strategies tend to be complex, indirect, psychody- 
namic, and often extraneous to treatment. In con- 
trast, behavioral assessment strategies tend to be 
simple, direct, behavior-analytic, and continuous 
with treatment. 

Behavior therapists use a wide range of modal- 
ities to evaluate their clients, patients, and subjects. 
The methods of behavioral assessment include, but 
are not limited to, behavioral observations, self- 
reports, parent ratings, staff ratings, sibling ratings, 


judges’ ratings, teacher ratings, therapist ratings, 
nurses’ ratings, physiological assessment, bio- 
chemical assessment, biological assessment, struc- 
tured interviews, semistructured interviews, and 
analogue tests. In their Dictionary of Behavioral 
Assessment Techniques, Hersen and Bellack 
(1988) list 286 behavioral tests used in widely 
diverse problems and disorders in children, ado- 
lescents, adults, and the geriatric population. 
Dozens more are referenced in a more recent com- 
pendium (Hersen & Bellack, 1998). So that the 
reader can appreciate the diversity of techniques 
available, we provide a sampling of these tests in 
Table 14.9. 

In recent years, a new form of behavioral 
assessment known as ecological momentary as- 
sessment has become increasingly popular. In eco- 
logical momentary assessment, the client carries a 
wireless handheld device similar to a personal dig- 
ital assistant and responds in real-time to pre- 
planned inquiries from the researcher. This 
approach is designed to circumvent a number of 
limitations of traditional self-report techniques. We 
discuss ecological momentary assessment in more 
detail at the end of this chapter. 

Behavioral assessment is often—but not al- 
ways—an integral part of behavior therapy de- 
signed to change the duration, frequency, or 
intensity of a well-defined target behavior. For 
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TABLE 14.9 A Sampling of Behavioral Assessment Tests and Techniques 


Abnormal Involuntary Movement Scale 
Activities of Daily Living-Modular Assessment 
Agoraphobic Cognitions Questionnaire 
Alcohol Beliefs Scale 

Antidepressive Activity Questionnaire 
Assertion Situations 

Assertiveness Self-Statement Test 
Automated Matching Familiar Figures Test 
Behavior Profile Rating Scale 

Behavioral Assessment of Bruxism 
Behavioral Avoidance Slide Test 
Behavioral Measures of Severe Depression 
Behavioral Visual Acuity Test 

Binge Eating Questionnaire 

Blood Alcohol Level 

Blood Pressure Reactivity 

Body Sensations Questionnaire 

Client Resistance Coding System 

Clinical Dementia Rating 

Combat Exposure Scale 

Compulsive Activity Checklist 

Conflict Behavior Questionnaire 

Daily Sleep Diary 

Derogatis Sexual Functioning Inventory 
Dieter’s Inventory of Eating Temptations 
Dysfunctional Attitudes Scale 


Expired Air Carbon Monoxide Measurement 
Family Interaction Coding System 

Fire Emergency Behavioral Situations Scale 
Georgia Court Competency Test-Revised 

Goal Attainment Scaling 

Height Avoidance Test f 
Irrational Beliefs Inventory 

Leyton Obsessional Inventory 

McGill Pain Questionnaire 

Michigan Alcohol Screening Test 

Musical Performance Anxiety Self-Statement Scale 
Panic Attack Questionnaire 

Parent-Adolescent Interaction Coding System 
Penile Volume Responses 

Rape Aftermath Symptom Test 

Self-Control Rating Scale 

Self-Statement Assessment via Thought Listing 
Sensation Seeking Scale-Form VI 

Sexual Experience Scales 

Spouse Verbal Problem Checklist 

Standardized Walk 

Subjective Probability of Consequences Inventory 
Test Meals in the Assessment of Bulimia Nervosa 
Treatment Evaluation Inventory 

Type A Structured Interview 

Visual Analogue Scale 





Source: Based on entries in Hersen, M., & Bellack, A. S. (Eds.). (1988). Dictionary of behavioral assessment techniques. 


New York: Pergamon. 


example, one therapy goal for a shy college stu- 
dent might be that she initiate a minimum of five 
conversations lasting two minutes or more each 
day. The therapist might recommend that she ap- 
proach this goal incrementally, beginning with a 
few brief social exchanges before proceeding to 
lengthier conversations with strangers. In this ex- 
ample, behavioral assessment might take the form 
of self-monitoring in which the student uses a 
wristwatch for timing and a diary for keeping track 
of conversations. 

As noted, behavioral assessment often exists in 
service of behavior therapy. In many cases, the na- 
ture of behavioral assessment is dictated by the pro- 
cedures and goals of behavior therapy. For this 
reason, the reader will better appreciate behavioral 
assessment tools if we interweave this topic with a 
discussion of behavior therapy methods. 


FOUNDATIONS OF 
BEHAVIOR THERAPY 


Behavior therapy, also called behavior modifica- 
tion, is the application of the methods and findings 
of experimental psychology to the modification of 
maladaptive behavior (Plaud & Eifert, 1998). The 
roots of behavior therapy can be traced to Skinner’s 
(1953) seminal book, Science and Human Behav- 
ior, which detailed the application of operant con- 
ditioning tothe problems of human. behavior. 
Skinner shunned any reference to private, nonob- 
servable events such as thoughts or feelings; he em- 
phasized the importance of identifying observable 
behaviors and methodically altering the environ- 
mental consequences of those behaviors. 
Research by Wolpe (1958) on the systematic 
behavioral treatment of phobias also was influential 
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in founding the methods of behavior therapy. 
Wolpe’s clinical procedures were derived from his 
laboratory work on the conditioning and counter- 
conditioning of fear in cats. Like Skinner, Wolpe 
deemphasized the significance of thoughts and be- 
liefs. He viewed fear as a learned phenomenon that 
could be unlearned by following a strict protocol of 
graduated exposure to the feared object or situation. 

More recently, Bandura (1977), Mahoney and 
Arnkoff (1978), and Meichenbaum (1977) reintro- 
duced cognitive factors into the ever-changing be- 
havioral framework. For example, Bandura (1977) 
demonstrated that persons are perfectly capable of 
cognitively based learning. In particular, he showed 
that individuals can learn from mere observation of 
the response contingencies experienced by models. 
Since this learning occurs in the absence of per- 
sonal consequences, it must be cognitively medi- 
ated. As a consequence of this paradigm shift, 
practically all modern-day behavior therapists con- 
cern themselves—at least to some extent—with the 
thoughts and beliefs of their clients. This new em- 
phasis is reflected in a family of very popular treat- 
ment procedures known collectively as cognitive 
behavior therapy (McMullin, 1986). 






ll BEHAVIOR THERAPY AND 
I BEHAVIORAL ASSESSMENT 

At present, the specific techniques of behavior ther- 
apy can be classified into five overlapping categories 
(Johnston, 1986): exposure-based methods, contin- 
gency management procedures, cognitive behavior 
therapies, self-control procedures, and social skills 
training. Behavioral assessment is used in all of these 
approaches, as reviewed in the following sections. 
However, there are relatively few behaviorally based 
tools for the evaluation of social skills, so this cate- 
gory is not discussed. Readers who desire limited 
coverage of instruments for the behavioral evalua- 
tion of social skills training (including assertiveness) 
should consult Meier and Hope (1998). 


Exposure-Based Methods 


Exposure-based methods of behavioral therapy are 
well suited to the treatment of phobias, which in- 


clude intense and unreasonable fears (e.g., of spi- 
ders, blood, public speaking). One approach to 
phobic avoidance is systematic exposure of the 
client to the feared situation or object. Wolpe 
(1973) favored gradual exposure with minimal anx- 
iety in a procedure known as systematic desensiti- 
zation. In this therapeutic approach, the client first 
learns total relaxation and then proceeds from 
imagined exposure to actual or in vivo exposure to 
the feared stimulus. Another exposure-based 
method is flooding or implosion in which the client 
is immediately and totally immersed in the anxiety- 
inducing situation. 

The therapist needs some type of behavioral as- 
sessment to gauge the continuing progress of a 
client undergoing an exposure-based treatment for 
a phobia. In the simplest possible assessment ap- 
proach, known as a behavioral avoidance test 
(BAT), the therapist measures how long the client 
can tolerate the anxiety-inducing stimulus. Here is 
one classic example of a standardized BAT used to 
evaluate patients with agoraphobia, a disabling fear 
of open spaces often accompanied by panic attacks: 


The standardized Behavioral Avoidance Test (BAT) 
was conducted a week after intake. All anxiolytics, 
antidepressants, or other psychotropic medication 
had been taken away at least 4 days before the test. 
The test was administered by the first author, who 
was blind to the patients’ diagnoses [and] not in- 
volved in the tréatment. The patients were asked to 
walk alone as far as they could from the hospital 
along a mildly trafficated road that was 2 km long. 
The route was divided into eight intervals of equal 
length, and the patients rated their anxiety level on 
a 0-10 scale at'the end of each interval. Uncom- 
pleted intervals were given a score of 10. An avoid- 
ance-anxiety score was computed by summing the 
anxiety scores for all intervals. (Hoffart, Friis, 
Strand, & Olsen, 1994) 


The researchers discovered that the avoidance- 
anxiety score from the BAT technique was strongly 
related to self-reports of catastrophic thoughts (e.g., 
choking to death, having a heart attack, acting fool- 
ish, becoming helpless). This finding illustrates that 
behavioral assessment approaches. often encom- 
pass a cognitive component as well. Notice, too, the 
direct relationship between the goal of therapy and 
the behavioral avoidance test. In agoraphobia, the 
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primary treatment goal is to reduce patients’ anxi- 
ety about walking alone in open spaces—which is 
exactly what the BAT measures. 

The BAT approach is predicated on the reason- 
able assumption that the client’s fear is the main de- 
terminant of behavior in the testing situation. 
Unfortunately, demand characteristics for desirable 
behavior may exert a strong influence on the 
client’s behavior. The client’s tolerance of the anx- 
iety-inducing stimulus will bear some relationship 
to experienced fear, but also has much to do with 
the situational context of assessment (McGlynn & 
Rose, 1998). The results of BAT assessments may 
not generalize, and the therapist must be wary of 
foreclosing treatment too soon. 

A fear survey schedule is another type of be- 
havioral assessment useful in the identification and 
quantification of fears. Fear survey schedules are 
face valid devices that require respondents to indi- 
cate the presence and intensity of their fears in re- 
lation to various stimuli, typically on a 5- or 7-point 
Likert scale. Dozens of these instruments have been 
published, including versions by Wolpe (1973), Ol- 
lendick (1983), and Cautela (1977). Tasto, Hick- 
son, and Rubin (1971) used factor analysis to 
develop a 40-item survey that yields a profile of 
fear scores in five categories. A generic fear survey 


TABLE 14.10 Example of a Fear Survey Schedule 


schedule is shown in Table 14.10. Fear survey 
schedules are often used in research projects to 
screen large samples of persons in search of sub- 
jects who share a common fear. Another use of 
these schedules is to monitor changes in fears, in- 
cluding those that have been targeted for clinical in- 
tervention. 

Klieger and Franklin (1993) have raised a num- 
ber of cautions about the use of fear survey sched- 
ules in clinical research. These authors note that 
reliability data for fear surveys are almost nonexis- 
tent. A more serious problem has to do with the va- 
lidity of these instruments. Using the Wolpe and 
Lang (1977) Fear Survey Schedule-III (FSS-II), a 
highly respected and widely used schedule, Klieger 
and Franklin (1993) found no relationship between 
reported fears on the FSS-IIJ and BAT measures of 
the same fears. For example, subjects who reported 
a high fear of blood on the FSS-III were just as 
likely to approach a bloody white towel and touch 
it as were subjects who reported no fear of blood. 
Similar results were found for subjects who feared 
snakes, spiders, and fire. The researchers concluded 
that the FSS-III and similar instruments are a poor 
choice for identifying experimental groups and a 
poor basis for measuring the outcome of therapeu- 
tic interventions. The essential downfall seems to 


Please check the column that best describes your current response to these situations or objects. 
Degree to which you would be disturbed 


Not at All 


Being in a strange place 
Speaking in public 
Walking into a party 
Getting an injection 
People watching me work 
Large open spaces 

Being fat 

Spider on the wall 

Cat in the room 
Reprimand from the boss 


Just a Little 


Moderate 
Amount 


Extremely 
Bothered 


Very Much 








Note: Most fear survey schedules consist of several dozen items. 
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be that fear survey schedules possess such “obvi- 
ous” validity that few researchers have bothered to 
evaluate the traditional psychometric characteris- 
tics of reliability and validity. Fear survey sched- 
ules should be used with caution. 


Contingency Management Procedures 


Contingency management is predicated on the as- 
sumption that all behavior—including disturbed or 
maladaptive behavior—is maintained by its conse- 
quences. A contingency management approach to 
behavior change proceeds in two steps. In step one, 
the therapist attempts to identify the positively re- 
inforcing consequences of the unwanted behavior. 
For example, in the case of an agoraphobic home- 
maker, the therapist might determine that she re- 
ceives attention from her family mainly when she 
is panicky and housebound. In this instance, the 
maladaptive behavior of housebound panic is being 
maintained—at least in part—by the untimely so- 
licitousness of family members. Step two consists 
of changing the contingencies for the unwanted be- 
havior. The therapist might recommend that family 
members not attend to the homemaker during her 
panicky episodes and that they do reinforce her 
small steps toward independence. For example, if 
the homemaker walks to the mailbox and back, the 
family members should shower her with attention. 
One form of contingency management widely 
used in institutional settings is the token economy. 
This approach is particularly well suited to clients 
with limited behavioral repertoires. In a token econ- 
omy, many different forms of prosocial behavior are 
rewarded with tokens that can be later exchanged 
for material rewards or privileges (Kazdin, 1988). 
Behavioral assessment in a token economy 
mainly takes the form of direct behavioral obser- 
vation (Foster, Bell-Dolan, & Burge, 1988). The 
application of a direct behavioral observation sys- 
tem is much more demanding than it sounds. The 
researcher or therapist must identify target behav- 
iors, define them precisely, produce a system for 
data recording, calibrate appropriate rewards, and 
train staff members (Paul, 1986). If the program 
works, the observed frequency of targeted proso- 


cial behaviors will increase gradually over a period 
of weeks or months. 

Token economies were enormously popular in 
the 1970s when they were even used by entire 
school systems to facilitate the improvement of so- 
cial skills in disadvantaged elementary school- 
children (Bushell, 1978). Token economies are still 
popular, but the initial uncritical enthusiasm has 
been replaced by a less sanguine view. Kazdin 
(1988) discusses the limitations of token economies 
that include (1) poor generalization of the target be- 
haviors beyond the treatment program, (2) difficul- 
ties in training staff to properly implement token 
economies, and (3) client resistance to participa- 
tion. In addition, reactivity can be a problem in di- 
rect behavioral observation: Clients may not behave 
naturally when they know they are being observed. 


Cognitive Behavior Therapies 


The one factor common to all cognitive behavior 
therapies is an emphasis on changing the belief 
structure of the client. The three best-known vari- 
ants of cognitive behavior therapy are Ellis’s 
(1962) rational emotive therapy (RET), Meichen- 
baum’s (1977) self-instructional training, and 
Beck’s (1976) cognitive therapy. Ellis postulates 
that most disturbed behavior is caused by irrational 
beliefs, such as the widespread belief that one must 
have the love and approval of all significant per- 
sons at all times. Ellis attempts to alter such core 
irrational beliefs, primarily by logical argument 
and forceful exhortation. Meichenbaum’s self- 
instructional technique consists of teaching the 
client to use coping self-statements to combat 
stressful situations. For example, a college student 
suffering from intense test-taking anxiety might be 
taught to use the following self-talk during exam- 
inations: “You have a strategy this time. .. . Take 
a deep breath and relax.... Just answer one 
question at a time... .” Beck’s cognitive therapy 
concentrates mainly on the role of cognitive 
distortions in the maintenance of depression and 
other emotional disturbances. Beck (1983) regards 
depression as primarily a cognitive disorder 
characterized by the negative cognitive triad: a 


TOPIC 14B BEHAVIORAL ASSESSMENT AND RELATED APPROACHES 549 


pessimistic view of the world, a pessimistic self- 
concept, and a pessimistic view of the future. In 
therapy, he uses a gentle form of cognitive re- 
structuring to help the client perceive his or her 
problems in alternative, solvable terms. 

Cognitive behavior therapists need not use for- 
mal assessment tools in their clinical practice. Typ- 
ically, these therapists monitor the belief structure 
of their clients on an informal session-to-session 
basis. Irrational and distorted thoughts are chal- 
lenged as they arise during therapy. In the end, the 
client’s self-report of improvement may constitute 
the main index of therapeutic success. Nonetheless, 
several straightforward measures of cognitive dis- 
tortion are available. We have outlined a few promi- 
nent instruments in Table 14.11. Other examples 
can be found in Clark (1988) and Haynes (1998). 
These instruments are mainly research question- 
naires suitable to the testing of group differences, 
but not sufficiently validated for individual assess- 
ment. Clark (1988) faults the developers of cogni- 
tive distortion questionnaires for premature release 
of their instruments. In particular, he notes the ab- 
sence of research on the concurrent and discrimi- 
nant. validity of most self-statement measures. 
Another problem is that existing questionnaires 
were designed to validate constructs in research and 
consequently do not work well in clinical practice. 

An exceptional and well-validated measure not 
listed in Table 14.11 is the Beck Depression Inven- 
tory (BDI). The BDI is a short, simple, self-report 
questionnaire that focuses, in part, on the cognitive 
distortions that underlie depression (Beck & Steer, 
1987; Beck, Ward, Mendelsohn, Mock, & Erbaugh, 
1961). One reason for its popularity is that most pa- 
tients can complete the 21 items on the BDI in 10 
minutes or less. The test has been widely used: 
More than 1,900 articles using the BDI have been 
published (Conoley, 1992). A second edition of the 
inventory was released in 1996 (Beck, Steer, & 
Brown, 1996). On the BDI-II, several items were 
revised so as to bring the inventory into closer con- 
formity with prevailing diagnostic criteria for de- 
pression. The 21 items are of the following form: 

Check the statement from this group that you 
feel is most true about you: 


0 I am upbeat about the future. 

1 I feel slightly discouraged about the future. 
2 I feel the future has little to offer for me. 

3 I feel that the future is utterly hopeless. 


Thirteen items cover cognitive and affective com- 
ponents of depression such as pessimism, guilt, 
crying, indecision, and self-accusations; eight 
items assess somatic and performance variables 
such as sleep problems, body image, work difficul- 
ties, and loss of interest in sex. The examinee re- 
ceives a score of 0 to 3 for each item; the total raw 
score. is the sum of the endorsements for the 21 
items; the highest possible score is 63. 

In a meta-analysis of BDI research studies, the 
internal consistency of the scale (coefficient alpha) 
ranged from .73 to .95, with a mean of .86 in nine 
psychiatric populations (Beck, Steer, & Garbin, 
1988). The BDI-II possesses excellent internal con- 
sistency with a coefficient alpha of .92 (Beck, Steer, 
& Brown, 1996). Test-retest reliability of the BDI is 
modest, with a range of .60 to .83 in nonpsychiatric 
samples and .48 to .86 in psychiatric samples. How- 
ever, the test-retest methodology is not well suited 
to phenomena such as depression that are naturally 
unstable. Subjective depression fluctuates dramati- 
cally from week to week, day to day, even hour to 
hour. A lackluster value for test-retest reliability 
might signify valid change in the construct being 
measured rather than unwanted measurement error. 

A variety of normative results are available, 
with BDI data for samples of patients with major 
depression, dysthymia, alcoholism, heroin addic- 
tion, and mixed problems. The manual also pro- 
vides guidelines for degree of depression based 
upon BDI score (0 to 9, normal; 10 to 19, mild to 
moderate; 20 to 29, moderate to severe; 30 and 
above, extremely severe). These ratings are based 
upon clinical evaluations of patients. 

The BDI has been extensively validated against 
other measures of depression and independent cri- 
teria of depression. For example, correlations with 
clinical ratings and scales of depression such as 
from the MMPI are typically in the range of .60 to 
.76 (Conoley, 1992). Sex differences are minimal, 
although there may be slight differences in the 


550 _CHAPTER 14 STRUCTURED PERSONALITY ASSESSMENT 


TABLE 14.11 Questionnaire Measures of Cognitive Distortion 


Anxious Self-Statements Questionnaire (ASSQ) 

(Kendall & Hollon, 1989) 
Examinee rates how often specific anxious thoughts occurred over the last week. Items 
are of the form: 


I can’t stand it anymore. 
What’s going to happen to me now? 
I’m not going to make it. 


A psychometrically sound instrument, the ASSQ can be used to assess changes in the 
frequency of anxious self-talk. 


Automatic Thoughts Questionnaire (ATQ) 

(Hollon & Kendall, 1980; Kazdin, 1990) 
The ATQ is a frequency measure of depression-related cognitions that assesses personal 
maladjustment, negative self-concept and expectations, low self-esteem, and giving 
up/helplessness. The 30-item ATQ correlates very well with the MMPI Depression 
scale and the Beck Depression Inventory (Ross, Gottfredson, Christensen, & Weaver, 
1986). 


Cognitive Errors Questionnaire (CEQ) 

(Lefebvre, 1981) 
The CEQ assesses the degree of maladaptive thinking in general situations and also ` 
situations related to chronic low back pain. Discrete vignettes concerning chronic back 
pain and general scenes are each followed by an illogical dysphoric cognition. The re- 
spodent indicates on a 5-point scale how similar the cognition is to the thought he or she 
would have in the same situation. For example: “You just finished spending three hours 
cleaning the basement. Your spouse, however, doesn’t say anything about it. You think to 
yourself, ‘S(he) must think I did a poor job.’” Smith, Follick, Ahern, and Adams (1986) 
found that overgeneralization was the specific CEQ cognitive error most consistently 
correlated with chronic low back pain disability. 


Attribution Styles Questionnaire (ASQ) 
(Seligman, Abramson, Semmel, & Von Baeyer, 1979) 


The ASQ measures three attributional dimensions relevant to Seligman’s learned help- 
lessness model of depression: internal-external, stable-unstable, and global-specific. 
Depressed persons attribute bad outcomes to internal, stable, and global causes; they 
attribute good outcomes to external, unstable causes. The questionnaire consists of 12 
hypothetical situations, 6 describing good outcomes, 6 describing bad outcomes (e.g., 
“You have been looking for a job unsuccessfully for some time”). The respondents rate 
each vignette on a 7-point scale for degree of internality, stability, and globality. 


Hopelessness Scale (HS) 

(Beck, 1987; Dyce, 1996) 
A 20-item true/false scale, the HS is designed to quantify hopelessness, one component 
of the negative cognitive triad found in depressed persons. (The triad consists of negative 
views of self, world, and future.) The scale is sensitive to changes in the patient’s state 
of depression. In a validational study, Beck, Riskind, Brown, and Steer (1988) found that 
HS scores had a negligible relationship to anxiety or general psychopathology when the 
influence of coexisting depression was partialed out. Thus, the HS appears to measure a 
specific attribute of depression rather than general psychopathology. 
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expression of depression between men and women 
(Steer, Beck, & Brown, 1989). 

The only shortcoming of the BDI is its trans- 
parency. Patients who wish to hide their despair or 
exaggerate their depression can do so easily. How- 
ever, for patients who are motivated to accurately 
reflect their emotional status, the BDI and BDI-II 
are probably unbeatable as an index of the presence 
and degree of depression (Stehouwer, 1987). Some 
practitioners ask patients to complete the BDI after 
each therapy session; they use the BDI much as a 
physician might use a thermometer. 


Self-Monitoring Procedures 


A common misconception about behavior therapy is 
that it consists of authoritarian therapists applying 
powerful rewards and punishments to passive cli- 
ents. Although this stereotypical model may be true 
for some impaired clients with limited behavioral 
repertoires, for the most part behavior therapy con- 
sists of humane practitioners teaching their clients 
methods of self-control. An emphasis upon self- 
monitoring is fundamental to all forms of behavior 
therapy. In self-monitoring, the client chooses the 
goals and actively participates in supervising, chart- 
ing, and recording progress toward the end point(s) 
of therapy. According to this model, the therapist is 
relegated to the status of expert consultant. 

Self-monitoring procedures are especially use- 
ful in the treatment of depression, a prevalent 
behavior disorder consisting of sad mood, low ac- 
tivity level, feelings of worthlessness, concentra- 
tion problems, and physical symptoms (sleep loss, 
appetite disturbance, reduced interest in sex). Sev- 
eral self-monitoring programs for depression have 
been reported (Lewinsohn & Talkington, 1979; 
Rehm, 1984; Rehm, Kornblith, O’ Hara, & others, 
1981). In order to illustrate the self-monitoring ap- 
proach to the control of depression, we will sum- 
marize one small corner of the program advocated 
by Lewinsohn and his colleagues (Lewinsohn, 
Munoz, Youngren, & Zeiss, 1986). 

Lewinsohn observed that depression goes hand 
in hand with a marked reduction in the experienc- 
ing of pleasant events. Depressed persons retreat 


from engaging in pleasant activities; the behavioral 
withdrawal only contributes further to their depres- 
sion, inciting a continuous downward spiral. Fortu- 
nately, it is possible to replace the downward spiral 
with an upward one. To help reverse the downward 
spiral of depression, Lewinsohn and his colleagues 
devised the Pleasant Events Schedule (PES; Mac- 
Phillamy & Lewinsohn, 1982). The purpose of the 
PES is twofold. First, in the baseline assessment 
phase, the PES is used to self-monitor the frequency 
(F) and pleasantness (P) of 320 largely ordinary, 
everyday events. Examples of the kinds of events 
listed on the PES include the following: 


reading magazines 

going for a walk 

being with pets 

playing a musical instrument 
making food for charity 
listening to the radio 
reading poetry 

attending a church service 
watching a sports event 
playing catch with a friend 
working on my job 


The frequency and pleasantness of these everyday 
events are both rated 0 to 2.! The mean rate of 
pleasant activities is then calculated from the sum 
of the F x P scores; that is, mean rate = F x P/320. 
Normative findings for mean Æ mean P, and mean 
F x P are reported in Lewinsohn, Munoz, Youn- 
gren, and Zeiss (1986) and serve as a basis for treat- 
ment planning. Participants in the Lewinsohn 
program also monitor their daily mood on a simple 
1 (worst) to 9 (best) basis. 


1. The Frequency Scale is calibrated as follows: 


0—This has not happened in the past 30 days. 

1—This has happened a few times (1 to 6 times) in the past 30 
days. 

2—This has happened often (7 times or more) in the past 30 
days. 


The Pleasantness Scale is calibrated as follows: 


0—This was not pleasant. 
1—This was somewhat pleasant. 
2—This was very pleasant. 
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The second use of the PES is to self-monitor 
therapeutic progress. Based on the initial PES re- 
sults, clients identify 100 or so potentially pleasant 
events and strive to increase the frequency of these 
events, monitoring daily mood along the way. 
Clients who increase: the frequency of pleasant 
events generally show an improvement in mood 
and other depressive symptoms. 

The PES is a highly useful tool for clinicians who 
wish to implement a self-monitoring approach to the 
assessment and treatment of depression. MacPhil- 
lamy and Lewinsohn (1982) report favorably on the 
technical qualities of the PES and discuss a variety of 
rational, factorial, and empirical subscales, which we 
cannot review here. The instrument has fair to good 
test-retest reliability (one-month correlations in the 
range of .69 to .86), excellent concurrent validity with 
trained observers, and promising construct validity. 
In general, the subscales behave as one would predict 
on the basis of the constructs they purport to mea- 
sure—we refer the reader to MacPhillamy and 
Lewinsohn (1982) for details. 

Self-monitoring approaches to behavioral as- 
sessment have applications in many subspecialties 
of psychology and the health professions. We will 
briefly outline one other assessment device to illus- 
trate the diversity of methods available. Schlundt 
and Bell (1993) developed a microcomputer-based 
Body Image Testing System (BITS) to assess body 
image in persons with eating disorders. The eating 
disorder known as anorexia nervosa is character- 
ized by refusal to maintain normal body weight, an 
intense fear of gaining weighi, and a significant 
disturbance in the perception of the shape or size 
of the body. The majority of patients with anorexia 
nervosa are young females—the average age of 
onset is 17 years. A person with this disorder per- 
ceives her body to be much fatter than is actually 
true. But how can this misperception be measured 
and quantified? The BITS program is one approach 
to self-monitoring of distortions in body image. 
Briefly, the subject uses a menu to change a com- 
puter-generated body image until it resembles her 
self-perception. 

Instructions for the BITS program specify that 
the subject is to adjust the individual body parts until 


the image resembles her actual body dimensions. In 
a second trial, the subject generates ideal body di- 
mensions. The discrepancy between actual and ideal 
dimensions represents the subject’s degree of satis- 
faction with her body. In addition, the BITS pro- 
gram yields an index of perceptual distortion based 
upon a regression analysis of height and weight ver- 
sus the self-generated (actual) body image. 

Using data from 94 undergraduate females 
retested again after two to four weeks, BITS mea- 
sures were found to have acceptable reliability. For 
example, the overall satisfaction rating for the nine 
body parts showed a test-retest reliability of .80. 
Normative data for 528 subjects drawn mainly 
from undergraduate samples are also available. The 
construct validity of the procedure was found to be 
very promising, based upon three sources of infor- 
mation: strong correlations between BITS scores 
and actual body size, strong correlations between 
BITS variables and other measures of body image, 
and the ability of the measure to predict distur- 
bances in eating behavior such as dieting, binge 
eating, and emotional eating. 

The BITS procedure is an excellent example of 
a new generation of psychological tests made pos- 
sible by recent developments in microcomputer 
technology. Tests that make creative use of com- 
puter graphics for stimulus and/or response display 
will become commonplace in the future. Additional 
approaches to computer-assisted psychological as- 
sessment are discussed in Topic 15A. 


ASSESSMENT OF 
NONVERBAL BEHAVIOR 


Nonverbal behavior includes the subtler forms of 
human communication contained in glance, ges- 
ture, body language, tone of voice, and facial ex- 
pression. Although nonverbal communication 
plays a crucial role in human behavior, our knowl- 
edge of it is very incomplete. In part, this ignorance 
reflects the strong verbal orientation of our society; 
we equate communication with the successful use 
of words. But there are other, more subtle reasons 
for the lack of scientific knowledge about nonver- 
bal communication: 
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Different types of nonverbal communication are so 
embedded in our daily lives that we use the nonver- 
bal messages without being aware of them. When we 
form an opinion of what someone is like, for exam- 
ple, the opinion is probably based in part upon a 
complex analysis of nonverbal information. When 
we conclude that someone we have just met is angry 
or jealous or anxious to leave, we may have reached 
this conclusion as much by listening to the person’s 
tone of voice and by observing how agitated the per- 
son’s movements were, or by forming an impression 
of the warmth of his or her facial expression, as by 
interpreting what was actually said. (Rosenthal, 
Hall, DiMatteo, Rogers, & Archer, 1979) 


In this section we will investigate the scientific 
study of nonverbal communication, focusing on 
tools and methods that can be used for purposes of 
assessment. We begin by reviewing methods for 
discerning the nature of underlying personality 
from such external behavioral signs as visual inter- 
action, paralinguistic cues (e.g., tone of voice), and 
facial expression. The chapter closes with an analy- 
sis of the Profile of Nonverbal Sensitivity (PONS), 
a fascinating film or video test for assessing per- 
sonal sensitivity to the nonverbal communication 
of others. 


Visual Interaction 


Visual interaction has long been recognized as a 
key to other aspects of personality, although it is 
only in the last few decades that systematic re- 
search on this topic has appeared. Nielsen (1962) 
carried out the first empirical investigation of gaze 
in social behavior. His study was a purely observa- 
tional analysis of social confrontation and visual 
interaction. Shortly thereafter, other researchers 
embarked on programmatic studies, using standard 
experimental designs, with visual contact as the de- 
pendent variable, and later as the independent vari- 
able (summarized in Argyle & Cook, 1976; Fehr & 
Exline, 1987; Exline & Fehr, 1982). In large mea- 
sure, this research has sought to determine the 
implications of individual differences in such vari- 
ables as initiation of, maintenance of, and comfort 
with individual and mutual eye contact. A few 
fragile generalizations have emerged from this re- 


search, which we summarize later. More notewor- 
thy, however, is a dawning consensus as to the dif- 
ficulty of obtaining meaningful measurements 
about visual interaction. Let us briefly review the 
findings and then summarize the obstacles to ser- 
viceable research on this important topic. 

Visual interaction is best viewed as a social 
behavior that acts as a powerful signal from one 
person to another. The meaning of the signal is 
determined in many complex ways, so the inter- 
pretation of another person’s gaze pattern is very 
difficult. Nonetheless, several consistent findings 
do emerge from research on this topic (Argyle & 
Cook, 1976; Fehr & Exline, 1987): 


1, Gaze often acts as a signal for liking, especially 
from dependent persons. 

2. Mutual gaze, especially if prolonged, may sig- 
nify a special kind of intimacy. 

3. Extraverts and self-confident persons establish 
visual contact more often and for longer peri- 
ods of time. 

4. Aversion of gaze increases when persons are 
close together or discussing intimate topics. 

5. Dishonest persons reduce visual interaction 
during episodes of deception. 

6. Females look more often, even in infancy. 

7. The amount of gaze is high during childhood, 
falls during adolescence, and then rises again. 

8. Affiliative persons gaze more in cooperative 
situations, but less in competitive situations. 

9. Lower-status persons gaze more than higher- 
status persons. 

10. Marked cultural differences exist; Arabs and 
Latin Americans gaze more, certain American 
Indians gaze less. 

11. Autistic, schizophrenic, and depressive per- 
sons tend to avoid looking at other persons. 


It should be obvious from the preceding list that the 
interpretation of gaze depends on numerous vari- 
ables and is also influenced by the interaction be- 
tween variables (e.g., affiliation and competition). 

The practitioner who wishes to interpret indi- 
vidual differences in visual interaction must first 
implement a data-gathering paradigm of some sort. 
Exline and Fehr (1982) have drawn attention to the 
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difficulties involved in the assessment of gaze and 
mutual gaze. These difficulties include the reactiv- 
ity of measurements made in laboratory settings 
and problems of reliability of measurements. The 
pragmatic problems of assessing and interpreting 
visual interaction are substantial. In fact, the obsta- 
cles most likely preclude the development of stan- 
dardized tests for the measurement of gaze and 
mutual gaze. Nonetheless, knowledge of the find- 
ings in this area can be used in conjunction with 
other nonverbal cues in practical clinical applica- 
tions such as the detection of deception (Ekman & 
Friesen, 1975). 


Paralinguistics 


Paralinguistics refers to tone of voice, rate of 
speaking, and other nonverbal aspects of speech. 
Often, paralinguistic cues are more powerful than 
the overt spoken message, as when we declare, “It 
wasn’t what he said; it was the way he said it.” An 
essential maxim of paralinguistic research is that 
the content of speech must be separated from its af- 
fective nuances. 

An important area for paralinguistic assessment 
has been emotion judged from “content-filtered” 
speech. In content-filtered speech, a person’s voice 
is recorded (or a tape is rerecorded) through a low- 
pass filter to remove high-frequency sounds and 
thereby render the words themselves unrecogniz- 
able. Content-filtered speech maintains the essen- 
tial paralinguistic aspects of speech (pitch variation 
and contour, tempo, volume, tonality, and rhythm) 
but eliminates the potential distraction of the con- 
tent. Trained judges can then rate the content-fil- 
tered speech for affective components such as 
anger or anxiety. 

Milmoe, Rosenthal, Blane, Chafetz, and Wolf 
(1967) conducted a pathbreaking study that illus- 
trates the practical application of content-filtered 
speech. They recorded the voices of resident phy- 
sicians who had completed a tour of duty in an 
alcoholism clinic. In particular, the researchers 
excerpted the doctors’ replies to the interview ques- 
tion, “What has been your experience with alco- 
holics?” These replies were content-filtered, and 


then rated along a six-point scale (1 = none, 6 = a 
lot) by a panel of judges for anger, sympathy, anx- 
iety, and matter-of-factness.? The interrater relia- 
bility of the emotional dimensions varied quite 
markedly and was also affected by the sex of the 
judge. For example, the agreement between male 
and female judges was good for anxiety (r = .84) 
but nonexistent for matter-of-factness (r = —.07). 

The researchers also collected post hoc infor- 
mation on the doctors’ success in getting alcoholic 
patients to seek treatment. The paralinguistic data 
(ratings of anger, sympathy, anxiety, and matter- 
of-factness) were then correlated with objective 
data on referral effectiveness. The relationship be- 
tween judges’ ratings of anger for the content-fil- 
tered speech correlated strongly with effectiveness 
in referring alcoholic patients (r = —.67, p = .06). 
Other variables were unrelated to effectiveness. In 
sum, doctors with high paralinguistic ratings for 
anger were ineffective in convincing alcoholic pa- 
tients to seek treatment. The same research team 
has shown that ratings of the mother’s voice (con- 
tent-filtered) are correlated with aspects of baby’s 
behavior such as irritability, insecurity, and atten- 
tiveness (Milmoe, Novey, Kagan, & Rosenthal, 
1974). 


Facial Expression 


Nonverbal communication is often mediated 
through facial expression. Indeed, the facial-ex- 
pressive apparatus is in large measure dedicated to 
facial displays of emotion. Additionally, the face 
helps regulate several nonemotional functions such 
as speech production, eating, respiration, smell, 
and protection of the eyes. 

The crucial role of the face in communication 
via the display of emotions has prompted re- 
searchers to develop objective methods for observ- 
ing and quantifying facial action. Two major 
methods for this purpose have emerged: (1) mea- 
surement of visible facial actions using facial coding 


2. Two other dimensions were rated from the content of the 
tapes: sophistication and psychological-mindedness. We do not 
discuss these findings here. 
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systems, and (2) the measurement of electrical dis- 
charges from the contraction of facial muscles. 
Ekman (1982) and Fridlund, Ekman, and Oster 
(1987) review both developments in detail. We will 
highlight the best-known facial-coding system here. 

The Facial Action Coding System (FACS; 
Ekman & Friesen, 1978) was developed as a gen- 
eral-purpose evaluation tool suitable to a wide 
range of research and assessment purposes. Based 
upon an elaborate electrophysiological analysis of 
the precise role of each facial muscle in visible fa- 
cial expression, Ekman and Friesen derived 44 ac- 
tion units (AUs) that can, singly or in combination, 
account for all visible facial movement. All of the 
AUs are scoreable on a five-point intensity scale. 
For example, AU 1 concerns brow raising, which is 
controlled by one large muscle in the forehead area. 
Ekman and Friesen (1978) provide detailed in- 
structions for rating each AU from video record- 
ings. FACS also allows for coding onset, apex, and 
offset time of each AU. 

FACS is difficult to learn and use because it re- 
quires repeated, slow-motion viewing of facial 
actions. However, once the system is mastered, 
competent judges produce highly reliable ratings of 
facial actions. Furthermore, the ratings predict 
emotional states with considerable accuracy. In one 
important validity study, Ekman, Friesen, and An- 
coli (1980) recorded the facial action of persons 
viewing both pleasant and unpleasant films. The 
FACS accurately predicted the subjects’ retrospec- 
tive reports of emotional experience (happiness, 
negative feelings, disgust). At least in some con- 
texts, particular facial actions do signal particular 
emotions. What needs to be sorted out in future re- 
search is the generality of the link between coded 
facial actions from FACS and the experience of pre- 
dicted emotions. 


Profile of Nonverbal Sensitivity (PONS) 


The PONS was developed in the 1970s by Robert 
Rosenthal and colleagues. to study the ability to 
comprehend nonverbal cues transmitted by facial 
expressions, body movements, and tone of voice 
(Rosenthal, Hall, DiMatteo, Rogers, & Archer, 


1979). One impetus for inventing this new assess- 
ment instrument was Rosenthal’s earlier discovery 
of the experimenter expectancy effect (Rosenthal, 
1967). Research on the experimenter expectancy 
effect showed that the experimenter can uninten- 
tionally influence subjects to change their behavior 
in the direction of the experimenter’s expectations, 
thereby creating self-fulfilling prophecies in the re- 
search laboratory and the classroom. Rosenthal rea- 
soned that the experimenter expectancy effect must 
be facilitated by the unwitting communication of 
nonverbal information from the experimenter to the 
subject. The PONS was developed, in part, to study 
the manner in which these nonverbal communica- 
tions might be decoded. 

The stimuli for the PONS are administered by 
means of film or videotape and consist of 220 two- 
second segments of one female’s nonverbal behav- 
ior.? For each stimulus item, the examinee must 
choose which affective or emotional situation is 
being portrayed; two alternatives are listed on the 
answer sheet. For example, one visual item shows 
the portrayer’s face for two seconds as she depicts 
an emotional scene. The test taker is asked to 
choose between two descriptions of what the per- 
son is doing: (a) nagging a child, or (b) expressing 
jealous anger. The audio has been “cut” for this 
item so the examinee must decode the information 
purely from facial expression. Since the portrayer 
is on camera for only two seconds, there is little 
chance that the subject can use lip reading to obtain 
useful verbal information. 

The PONS assesses the examinee’s ability to de- 
code nonverbal behavior from 1 1 different channels. 
These include three that are “pure” visual channels, 
two different auditory (paralanguage) channels, and 
six combined (visual plus auditory) channels. The 
visual channels include the following: 


1. The face 
2. The body from the neck to the knees 
3. The entire figure 


3. The PONS comes in five versions, including audio only, 
video only, brief PONS, and nonverbal discrepancy test. We dis- 
cuss only the full PONS here. 
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The auditory channels include the following: 


4. The randomized-spliced (RS) voice of the 
speaker 
5. The content-filtered (CF) voice of the speaker 


The combined channels include the following: 


6. The face plus RS voice 
7. The body plus RS voice 
8. The figure plus RS voice 
9. The face plus CF voice 
10. The body plus CF voice 
11. The figure plus CF voice 


The portrayer in the PONS is shown expressing 
20 different affective or emotional situations, rang- 
ing from relatively subtle emotions (e.g., “express- 
ing motherly love”) to more dramatic situations 
(e.g., “threatening someone”). Since each of the 20 
scenes appears in each of the 11 channels, the test 
consists of a total of 220 scenes. The PONS takes 
about 45 minutes to complete. 

The main standardization sample for the PONS 
consists of 492 public senior high school students 
from three locations (West Coast, Midwest, and 
East Coast) in the United States. These subjects 
were of average intelligence and from primarily 
middle-class families. Using the KR-20 formula, 
the authors computed internal consistency correla- 
tions for test results of these students. The results 
indicate generally adequate homogeneity for the in- 
dividual channels, with the blatant exception of the 
randomized-spliced voice channel (r = .06). Un- 
fortunately, the test-retest stability coefficients for 
PONS channel scores were found to be marginal at 
best (Table 14.12). Based upon 293 students 
retested after 10 days to 10 weeks, most of the test- 
retest correlations for individual channel scores 
were in the .20s and .30s. The only stability coeffi- 
cient that was even marginally passable was based 
on the entire scale (r = .69). 

The validity of the PONS has been investigated 
in dozens of studies summarized by the authors 
(Rosenthal et al., 1979). In general, correlations 
with other indices of interpersonal sensitivity are 
appropriately positive (support for criterion valid- 
ity), whereas correlations with cognitive variables 
such as IQ are suitably low (support for discrimi- 


TABLE 14.12 Median PONS Test-Retest Reliabil- 
ity Coefficients from Six Samples (N = 293) 


Video Channel 
Audio 

Channel None Face Body Figure Total 
None 24 34 24 Pi 
Randomized- 

Spliced 18 43 .26 .20 50 
Content- 

Filtered 27 .20 .24 27 ‚50 
Total .32 49 54 51 69 





Source: Reprinted with permission from Rosenthal, R., Hall, 

J. A., DiMatteo, M. R., Rogers, P. L., & Archer, D. (1979). Sensi- 
tivity to nonverbal communication: The PONS test. Baltimore: 
Johns Hopkins University Press. 


nant validity). In one interesting study, the effec- 
tiveness of foreign service officers was found to cor- 
relate .30 with scores on the audio version of the 
PONS; that is, the more-effective officers were bet- 
ter than the less-effective officers at decoding non- 
verbal information contained in content-filtered and 
random-spliced voices. Recent studies have investi- 
gated the relationship between psychological disor- 
ders and performance on the PONS. In one study, 
higher Beck Depression Inventory scores in adults 
were associated with diminished accuracy on the 
PONS (Ambady & Gray, 2002). In another study, 
relatives of persons with schizophrenia scored sig- 
nificantly worse than controls on the PONS, despite 
comparable performance on other cognitive skills 
(Toomey, Seidman, Lyons, Faraone, & Tsuang, 
1999). We may conclude, then, that the PONS total 
score is a fair index of sensitivity to nonverbal com- 
munication. However, the individual channels of the 
PONS test possess such marginal reliability that 
they should be used for research purposes only. Hall 
(2001) provides a thorough review of the PONS as 
an index of interpersonal sensitivity. 


ECOLOGICAL MOMENTARY 
ASSESSMENT 


Recent advances in wireless connectivity have 
spawned an entirely new approach to assessment 





TOPIC 14B BEHAVIORAL ASSESSMENT AND RELATED APPROACHES 557 


known as ecological momentary assessment (EMA). 
Ecological momentary assessment is defined as 
the “real-time measurement of patient experience in 
the real world, at the point of experience” (Shiffman, 
Hufford, & Paty, 2001). Consider the research prob- 
lem of determining whether a new drug treatment is 
effective in ameliorating the severe pain of migraine 
headaches. Whereas previous research methods re- 
lied upon retrospective questionnaire reports of pa- 
tients receiving a new drug treatment, an EMA 
approach instead would consist of patients reporting 
their instantaneous experiences on a handheld de- 
vice, with responses immediately transmitted (via 
the same wireless technology used by cell phones) 
to a central computer for ultimate analysis with so- 
phisticated software. For example, the handheld de- 
vice might “beep” to signal that the patient should 
immediately respond (on a touch-sensitive screen) 
to a series of rating scales for pain, mood, fatigue, 
and other relevant dimensions. The entire self-rating 
procedure might take less than a minute. The ratings 
would be requested several times a day on a ran- 
domized schedule. 

Because EMA responses of clients are imme- 
diate and based on a schedule determined by the re- 
searcher, several biases of human recall are 
avoided. For example, consider the biasing effects 
of saliency, in which emotionally charged events 
dominate recall. For instance, a very brief episode 
of severe migraine pain may be recalled as lasting 
. much longer than the actual experience because of 
the emotional valence of the incident. Whereas a 
retrospective questionnaire report of this pain 
would be affected by the salience of the event, an 


EMA analysis, with periodic real-time sampling of 
the actual pain experiences, would provide a more 
accurate portrayal of the episode. Recency is an- 
other recall bias that is circumvented by EMA. The 
recency bias refers to the fact that people are more 
likely to recall recent events than remote events. 
Potentially, this could lead to underestimation of 
the therapeutic effects of a drug if retrospective re- 
call coincided with the onset of symptoms. In con- 
trast, with an EMA analysis, client reporting 
consists of periodic and instantaneous time sam- 
ples; the results are relatively unaffected by the re- 
cency bias. 

In general, EMA provides a more accurate and 
reliable approach to the assessment of patient expe- 
rience than traditional approaches such as retro- 
spective questionnaires. One advantage is that 
compliance cannot be faked (as when patients fill 
out a week’s worth of daily questionnaires minutes 
before handing them in to the researcher). In fact, 
because EMA approaches are highly user-friendly, 
researchers report an astonishing overall compli- 
ance of 93 to 99 percent averaged across many stud- 
ies (Shiffman et al., 2001). EMA has been used in 
research into treatments for acute pain, alcoholism, 
arthritis, asthma, depression, eating disorders, head- 
aches, hypertension, gastrointestinal disorders, 
schizophrenia, smoking, and urinary incontinence 
(Shiffman & Hufford, 2001; Shiffman, Hufford, 
Hickcox, and others, 1997; Smyth, Wonderlich, 
Crosby, and others, 2001). As EMA technology be- 
comes streamlined and more affordable, we can ex- 
pect this new technique to become commonplace in 
psychological outcome studies with human clients. 


SUMMARY 


1. Behavioral assessment concentrates on 
behavior itself rather than on underlying traits, hy- 
pothetical causes, or presumed dimensions of per- 
sonality. Behavioral assessment is usually an 
integral part of behavior therapy designed to change 
the duration, frequency, or intensity of a well- 
defined target behavior. 


2. One assessment approach useful in expo- 
sure-based methods of behavior therapy is the be- 


havioral avoidance test (BAT), in which the thera- 
pist charts how long the client can tolerate the anx- 
iety-inducing stimulus. Fear survey schedules, 
based upon self-ratings of commonly feared ob- 
jects and situations, are also useful, but there are 
reasons to question their validity. 

3. In contingency management, the therapist 
attempts to modify maladaptive behavior by identi- 
fying the reinforcing consequences and eliminating 
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them. Conversely, adaptive behaviors can be 
strengthened by arranging for the delivery of re- 
wards when these behaviors occur, such as in a 
token economy. 


4. In cognitive behavior therapy, the therapist 
attempts to change the belief structure of the client. 
For example, Meichenbaum teaches clients to use 
coping self-statements (e.g., “You have a strategy . . . 
you can do it”) to combat stressful situations. 


5. An excellent index of depression—includ- 
ing the cognitive distortions—is the Beck Depres- 
sion Inventory (BDI), which consists of 21 quartets 
of hierarchically ordered statements, each scored 0 
to 3. The BDI is recognized as an excellent self- 
report index of depression and is extensively vali- 
dated against external criteria. j 


6. Lewinsohn and his colleagues have pub- 
lished the Pleasant Events Schedule for self-moni- 
toring the frequency and pleasantness of up to 320 
largely ordinary, everyday behaviors. Depressed pa- 
tients who increase their self-monitored frequency 
of pleasant events generally show improved mood. 

7. Self-monitoring has many applications in 
psychology and medicine. One example is the 
Body Image Testing System (BITS) used to assess 
the distortions in body image found in persons with 
anorexia nervosa and other eating disorders. BITS 
provides the subject with a menu to change a com- 
puter-generated body image until it resembles her 
(or his) self-perception. 

8. A number of tools and methods exist for 
the assessment of nonverbal behavior such as vi- 
sual interaction. Unfortunately, the assessment of 


visual interaction is beset with practical and 
methodological pitfalls; the application of this ap- 
proach is restricted to research studies: 

9. Paralinguistics refers to tone of voice, rate 
of speaking, and other nonverbal aspects of speech. 
An important area for paralinguistic assessment 
has been emotion judged from “content-filtered” 
speech. For example, paralinguistic anger in a 
physician’s voice (judged from content-filtered 
speech) predicts poor effectiveness in referring al- 
coholic patients for treatment. 


10. Facial expression is another nonverbal be- 
havior useful in assessment. The Facial Action 
Coding System (FACS) is a systematic approach to 
the coding of facial expression from slow-motion 
videotapes. The system consists of 44 action units 
(AUs) that account for all visible facial movement. 
With FACS, trained judges can produce highly re- 
liable ratings of facial actions. 


11. The Profile of Nonverbal. Sensitivity 
(PONS) is a comprehensive assessment of the abil- 
ity to perceive nonverbal cues in a series of 220 
two-second segments (film or videotape) of one 
female’s nonverbal behavior. Although the reliabil- 
ity of PONS channels is low, the global PONS 
score is a fair index of sensitivity to nonverbal com- 
munication. 

12. Ecological momentary assessment is de- 
fined as real-time measurement of patient experi- 
ence in the real world, at the point of experience. 
This is a relatively new approach to assessment that 
relies upon wireless interconnectivity to circum- 
vent problems of retrospective report. 
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I: the previous chapters the myriads of ways that 
tests are used in decision making were outlined. 
Further, we have established that psychological 
testing is not only pervasive, it is also consequen- 
tial. Test results matter. Test findings may warrant 
a passage to privilege. Conversely, test findings 
may sanction the denial of opportunity. For many 
reasons, then, it is appropriate to close the book 
with two special topics bearing upon the poten- 
tial repercussions of psychological testing. In Topic 
15A, Computerized Assessment and the Future 


of Testing, current applications of the computer in 
psychological assessment are surveyed, and. then 
the professional and social issues raised by this 
practice are discussed. This topic closes with 
thoughts on the future of testing—which will be 
forged in large measure by increasingly sophisti- 
cated applications of computer technology. The 
theme of professional issues is continued in Topic 
15B, Ethical and Social Issues in Testing. We begin 
with an overview and history. of the computer in 
testing. 
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COMPUTERS IN TESTING: 
OVERVIEW AND HISTORY 


Introduction to Computer-Aided Assessment 


In many counseling centers it is possible for a client 
to make an appointment with a microcomputer to 
explore career options. Other than a brief interac- 
tion with the receptionist to schedule time at the 
computer, the client need not interact with any 
other human being during the entire assessment 
process. The exact scenario will differ from one set- 
ting to the next, but might resemble the following. 
Instructions on the computer screen encourage the 
user to press any key. The computer then prompts 
the client to answer a series of questions about ac- 
tivities and interests by pressing designated nu- 
meric keys. After completion of the inventory, the 
computer calculates raw scores for a long list of oc- 
cupational scales and makes appropriate statistical 
transformations. Next, a brief report appears on the 
screen. The report provides a list of careers that best 
fit the interests of the client. A hard copy is also 
printed for later review. Presumably, the client is 
better informed about compatible career options 
and therefore more likely to choose a satisfying line 
of work. This scenario is a simple example of com- 
puter-assisted psychological assessment (CAPA), a 
recent development hailed by many psychologists 
but criticized by others. 

It is common knowledge that computers are 
now used widely in psychological testing. How- 
ever, the breadth of these applications might sur- 
prise the reader. In addition to straightforward 
applications such as presenting test questions, scor- 
ing test data, and printing test results (as described 
earlier), computers can be used to (1) design indi- 
vidualized tests based upon real-time feedback dur- 
ing testing, (2) interpret test results according to 
complex decision rules, (3) write lengthy and de- 
tailed narrative reports, and (4) present test stimuli 
in engaging and realistic formats, including high- 
definition video and virtual reality. We touch upon 
all of these topics in our review. The umbrella term 
computer-assisted psychological assessment 
(CAPA) refers to the entire range of computer ap- 
plications in psychological assessment. CAPA 


holds great promise to the practice of psychology, 
but also presents a variety of practical and ethical 
problems that demand careful and thoughtful con- 
sideration. A brief history of CAPA is a good back- 
drop to the discussion of practical and ethical 
concerns. 


Brief History of CAPA 


The scoring of psychological tests by hand is te- 
dious, time-consuming, and error-prone. Psychol- 
ogists therefore eagerly embraced technology in 
their quest to improve the efficiency and accuracy 
of testing. The use of mechanical scoring machines 
for psychological tests such as the Strong Voca- 
tional Interest Blanks (SVIB) first occurred in the 
1920s. In 1946, Elmer Hankes built an analog com- 
puter for automatic scoring and profiling of the 
SVIB (Moreland, 1992). By the early 1960s, the 
combination of optical scanners and mainframe 
computers provided quick, error-free scoring and 
profile printing of tests such as the SVIB and the 
MMPI. 

The use of computers to provide test interpre- 
tations—not just scores and profiles—can be traced 
to the Mayo Clinic in the early 1960s (Swenson, 
Rome, Pearson, & Brannick, 1965). The Mayo 
group needed a rapid and efficient system for 
screening thousands of medical patients for psy- 
chological problems with the MMPI. Patients an- 
swered the MMPI items on special IBM cards that 
could be read into the computer by a scanner, The 
first interpretive system was crude by contempo- 
rary standards: 


A program was written that scored 14 MMPI 
scales, converted them into standard scores, and 
printed a series of descriptive statements. These 
statements were selected from a collection of 62 
statements, most of which were associated with the 
elevations on the MMPI scales. The program had 
some configural statements, but scale combina- 
tions, on which most of the literature is based, were 
largely ignored. (Fowler, 1985) 


Configural statements refer to interpretations based 
upon specific patterns of scale scores, such as high- 
point elevations on two designated scales. The suc- 
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cess of the Mayo system served as an impetus for 
computer-based test interpretation of many other 
psychological tests. For example, in the early 1960s, 
Piotrowski (1964) developed a computer-based 
Rorschach interpretation system. The Rorschach 
system required considerable “pre-processing” by a 
technician. The individual responses were first 
coded according to a list of 320 parameters. The in- 
terpretation was based upon these parameter scores, 
not upon the raw responses. 

By the 1970s, psychologists realized that com- 
puters could be integrated into the entire process of 
psychological assessment. Johnson and Williams 
(1975) described the use of a mainframe computer 
with several remote terminals to assess an average 
of 17 psychiatric inpatient admissions per day. 
Typically, patients completed the following com- 
puter-administered tests: MMPI, Beck Depression 
Inventory, intelligence test, memory test, and on- 
line social history. A structured mental status exam 
was conducted by an interviewer and entered di- 
rectly into the computer. The computer scored the 
tests and generated a comprehensive narrative re- 
port. In a series of research studies, the Utah group 
demonstrated that these reports were generated in 
half the time and at half the cost of traditional eval- 
uations (Klingler, Miller, Johnson, & Williams, 
1977). 

By the 1980s, CAPA was so prevalent that vir- 
tually every psychological test in existence could 
be interpreted by computer. A detailed chronology 
of developments is beyond the scope of this text. 
We have summarized major historical landmarks in 
Table 15.1. 


COMPUTER-BASED TEST 
INTERPRETATION: CURRENT STATUS 


Computer-based test interpretation, or CBTI, 
refers to test interpretation and report writing by 
computer. Every major test publisher now offers 
computer-based test interpretations. These services 
are available by mail-in, online computer with 
modem or on-site microcomputer package. More- 
over, the market for computer-based testing and re- 
port writing is so lucrative that we can anticipate 


TABLE 15.1 
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Historical Landmarks in CAPA 





1964 


1960s 


1971 


1975 


1979 


1970s 


1985 


1986 


1987 


1994 


1997 


Hankes develops an analog computer to score 
the SVIB (Moreland, 1992). 

Meehl’s (1954) book Clinical versus Statistical 
Prediction sets the stage for automated test in- 
terpretation. 

Optical scanner and digital computer are used to 
score SVIB and MMPI and also to print profiles 
(Moreland, 1992). 

First computer-based test interpretation system 
is developed for the MMPI at the Mayo Clinic 
(Swenson et al., 1965). 

Piotrowski publishes a system for computer- 
based interpretation of the Rorschach 
(Piotrowski, 1964). 

Computer-based interpretive systems for the 
MMPI proliferate; Fowler, Finney, and Caldwell 
develop popular systems (Fowler, 1985). 

A mainframe computer with terminals is used to 
automate the entire assessment process for psy- 
chiatric inpatients at the VA Hospital in Salt 
Lake City, Utah (Klingler, Miller, Johnson, & 
Williams, 1977). 

First automated interpretation of a neuropsycho- 
logical test battery (Adams & Heaton, 1985). 
Lachar publishes an actuarially based interpre- 
tive system for the Personality Inventory for 
Children (Lachar & Gdowski, 1979). 
Computerized adaptive testing (CAT) is intro- 
duced; CAT allows for flexible, individualized 
test batteries which produce a given level of 
measurement accuracy with the fewest possible 
test items (Weiss, 1982). 

A special series on computerized psychological 
assessment appears in the Journal of Consulting 
and Clinical Psychology (Butcher, 1985). 
American Psychological Association publishes 
Guidelines for Computer-Based Tests and Inter- 
pretations. 

Publication of the first resource book titled Com- 
puterized Psychological Assessment: A Practi- 
tioner’s Guide (Butcher, 1987). 

Introduction of multimedia assessment batteries; 
for example, at IBM, a multimedia test is used to 
assess the real-life problem-solving skills of pro- 
spective employees (APA Monitor, June 1994). 
Educational Testing Service and other testing 
giants move to computerized testing for major 
admissions tests such as the Graduate . 
Management Admission Test (GMAT) and 
Graduate Record Examinations (GRE). 
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massive growth in this field for many years to 
come. Butcher (1987, App. A) listed 169 vendors as 
of 1986. Conoley, Plake, and Kemmerer (1991) 
note that the number of computerized psychologi- 
cal test interpretations had increased to more than 
400 by 1990. New computerized test systems are 
reported virtually every month in trade magazines 
and newspapers (e.g., APA Monitor). Computer- 
based test interpretation is here to stay. 

In this section we will provide an overview of 
the types of computer-based test interpretations 
currently available. A comprehensive review of 
products could easily span several volumes, so the 
reader will have to settle for a discussion of diverse 
and representative examples of CBTI. We will ex- 
amine four approaches to CBTI: scoring reports, 
descriptive reports, actuarial reports, and computer- 
assisted clinical reports (Moreland, 1992). 


Scoring Reports 


Scoring reports consist of scores and/or profiles. In 
addition, a scoring report may include statistical 
significance tests and confidence intervals plotted 
for the test scores. By definition, scoring reports do 
not include narrative text or explanation of scores. 
Moreland (1992) discusses the appeal of scoring re- 
ports: 


These kinds of data make it possible to identify es- 
pecially meaningful scores and meaningful differ- 
ences among scores at a glance. They should also 
increase a user’s confidence that those scores are in 
fact important. Statistical significance tests are un- 
doubtedly superior to “clinical rules of thumb” 
when it comes to accurate interpretation of test 
scores. And who has time to hand calculate confi- 
dence intervals—especially for tests with dozens of 
scales? 


An example of a scoring report for the Jackson 
Vocational Interest Survey (Jackson, 1991) is 
shown in Figure 15.1. The reader will notice that a 
great deal of information is presented in an effi- 
cient, condensed manner. This is typical of scoring 
reports. In a single page, this hypothetical respon- 
dent would learn that his interests are highly simi- 
lar to majors in liberal arts, education, and business. 


In terms of occupational fit, he also learns that he 
is highly compatible with counselors, teachers, 
lawyers, administrators, and other professions with 
an emphasis upon human relations. 


Descriptive Reports 


A descriptive report goes one step further than a 
scoring report by providing brief scale-by-scale in- 
terpretation of test results. Descriptive reports are es- 
pecially useful when test findings are conveyed to 
mental health professionals who have little knowl- 
edge of the test in question. For example, most clini- 
cal psychologists know that a high score on the 
MMPI Psychasthenia scale signifies worry and dis- 
satisfaction with social relationships—but other 
mental health practitioners may not have a clue as to 
the meaning of an elevation on this scale. A descrip- 
tive report can convey invaluable information in a 
half page or less. One of the first descriptive reports 
published is portrayed in Figure 15.2, The reader will 
notice that the 20-year-old male patient is described 
as shy, sensitive, worried, and severely depressed. 
Referral of this medical patient to a psychologist or 
psychiatrist clearly is warranted. This report is a 
model of simplicity and clarity. By comparison, 
most contemporary computer-based descriptive re- 
ports provide excessive detail. Typically, the clini- 
cian must wade through several pages of narrative to 
extract essential features about the client. 


Actuarial Reports: Clinical versus 
Actuarial Prediction 


The actuarial approach to computer-based test inter- 
pretation is based upon the empirical determination 
of relationships between test results and the criteria 
of interest. The nature of this approach is best under- 
stood in the context of the longstanding debate on 
clinical versus actuarial prediction. A brief detour is 
needed here to introduce relevant concepts and is- 
sues before discussing actuarial reports. 

Many computer-based test interpretations make 
predictions about the test taker. These predictions 
are often disguised in the language of classification 
or diagnosis, but they are predictions nonetheless. 





RESPONDENT CASE, M. 123456789 MALE 27-MAR-84 PAGE 6 
SIMLIARITY TO COLLEGE AND UNIVERSITY STUDENT GROUPS 


FEMALES MALES 

AGRICULTURE -0.67 -0.57 (VERY LOW) 
ARTS & ARCHITECTURE -0.18 -0.08 (LOW) 
BUSINESS +0.53 +0.65 (VERY HIGH) 
EARTH AND MINERAL SCIENCE -0.72 (VERY LOW) 
EDUCATION +0.64 +0.65 (VERY HIGH) 
ENGINEERING -0.53 -0.61 (VERY LOW) 
HEALTH, PHYSICAL EDUC. 

& RECREATION -0.40 
HUMAN DEVELOPMENT +0.39 +0.70 (VERY HIGH) 
LIBERAL ARTS +0.71 +0.77 (VERY HIGH) 
SCIENCE -0.66 -0.60 (VERY LOW) 
NURSES +0.17 
MEDICAL STUDENTS +0.08 -0.12 (VERY LOW) 
TECHNICAL COLLEGE -0.43 (VERY LOW) 


SIMILARITY TO OCCUPATIONAL CLASSIFICATIONS 


BELOW ARE RANKED THE OCCUPATIONAL CLASSIFICATION FOUND TO BE SIMILAR TO YOUR 
INTEREST PROFILE. A POSITIVE SCORE INDICATES THAT YOUR PROFILE SHOWS SOME DEGREE 
OF SIMILARITY TO THOSE ALREADY WORKING IN THE OCCUPATIONAL CLUSTER, WHILE A 
NEGATIVE SCORE INDICATES DISSIMILARITY. 


SCORE SIMILARITY OCCUPATIONAL CLASSIFICATION 

+0.78 VERY SIMILAR COUNSELORS/STUDENT PERSONNEL WORKERS 

+0.74 VERY SIMILAR TEACHING AND RELATED OCCUPATIONS 

+0.74 VERY SIMILAR OCCUPATIONS IN RELIGION 

+0.71 VERY SIMILAR ADMINSTRATIVE AND RELATED OCCUPATIONS 

+0.68 VERY SIMILAR OCCUPATIONS IN LAW AND POLITICS 

+0.60 VERY SIMILAR PERSONNEL/HUMAN MANAGEMENT 

+0.57 SIMILAR OCCUPATIONS IN SOCIAL WELFARE 

+0.55 SIMILAR OCCUPATIONS IN SOCIAL SCIENCE 

+0.55 SIMILAR OCCUPATIONS IN PRE-SCHOOL & ELEMENTARY TEACHING 
+0.50 SIMILAR SALES OCCUAPTIONS 

+0.50 SIMILAR OCCUPATIONS IN MERCHANDISING 

+0.49 SIMILAR CLERICAL SERVICES 

+0.48 SIMILAR OCCUAPTIONS IN WRITING 

+0.44 SIMILAR OCCUPATIONS IN ACCOUNTING, :BANKING AND FINANCE 
+0.05 NEUTRAL SERVICE OCCUAPTIONS 

+0.03 NEUTRAL OCCUPATIONS IN MUSIC 

+0.02 NEUTRAL ASSEMBLY OCCUPATIONS-INSTRUMENTS & SMALL PRODUCTS 
-0.30 NEUTRAL OCCUPATIONS IN ENTERTAINMENT 

-0.24 NEUTRAL OCCUPATIONS IN COMMERCIAL ART 

-0.35 DISSIMILAR PROTECTIVE SERVICES OCCUPATIONS 

-0.38 DISSIMILAR AGRICULTURALISTS 

-0.39 DISSIMILAR MILITARY OFFICERS 

-0.41 DISSIMILAR OCCUPATIONS IN FINE ART 

-0458 DISSIMILAR SPORT AND RECREATION OCCUPATIONS 

-059 DISSIMILAR MATHEMATICAL AND RELATED OCCUPATIONS 


-0.62 VERY DISSIMILAR MACHINING/MECHANICAL & RELATED OCCUPATIONS 
-0.63 VERY DISSIMILAR HEALTH SERVICE WORKERS 

-0.65 VERY DISSIMILAR OCCUPATIONS IN THE PHYSICAL SCIENCES 

-0.65 VERY DISSIMILAR CONSTRUCTION/ SKILLED TRADES 

-0.68 VERY DISSIMILAR MEDICAL DIAGNOSIS AND TREATMENT OCCUPATIONS 
-0.71 VERY DISSIMILAR ENGINEERING & TECHNICAL SUPPORT WORKERS 
-0.76 VERY DISSIMILAR LIFE SCIENCES 





FIGURE 15.1 A Scoring Report for the Jackson Vocational Interest Survey 
Source: Reprinted with permission from Jackson, D. N. (1991). Jackson Vocational Interest Survey Manual (3rd ed.). Port Huron, MI: 
Sigma Assessment Systems, Inc. (800) 265-1285. 
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Sex: Male. Education: 20. Age: 34. Marital Status: Married. Outpatient. 

MMPI Code: 27”5’8064-391/ -KLF/ 

D 2 Severely depressed, worrying, indecisive, and pessimistic 

Pt 7 Rigid and meticulous. Worrisome and apprehensive. Dissatisfied with social 
relationships. Probably very religious and moralistic. 


Mf 5 Probably sensitive and idealistic with high esthetic, cultural, and artistic interests 


FIGURE 15.2 Sc 
Mayo Clinic MMPI Si 
Descriptive Report Pa 
Source: Reprinted with 

permission from Dahlstrom, Pd 

W. G., Welsh, G. S., & Hy 3 


Dahlstrom, L. E. (1972). An 


8 Tends toward abstract interests such as science, philosophy, and religion 
0. Probably retiring and shy in social situations 

6 Sensitive. Alive to opinions of others 

4 Independent or mildly nonconformist 


MMPI handbook. Volume]: Ma 9 Normal energy and activity level 


Clinical interpretation (rev. Hs 
ed.). Minneapolis: University 
of Minnesota Press, p. 309. 
Copyright © 1960, 1972 by 
the University of Minnesota. 


for clinic patients 


For example, when a computer-based neuropsy- 
chological test report tentatively classifies a client 
as having brain damage, this is actually an implicit 
prediction that can be confirmed or disconfirmed 
by external criteria such as brain scans and neuro- 
logical consultation. Likewise, when a computer- 
based MMPI-2 report provides a tentative DSM-IV 
diagnosis of a clinic referral, this is also a predic- 
tion that can be validated or invalidated by external 
criteria such as intensive clinical interview. A final 
example: When a computer-based CPI screening 
report for police candidates warns that an applicant 
will make a poor adjustment in law enforcement, 
this is also a prediction that could be proved correct 
or incorrect by an inspection of personnel records 
at a later date. 

The use of computers for test-based prediction 
highlights an essential distinction known as clini- 
cal versus actuarial judgment (Dawes, Faust, & 
Meehl, 1989; Garb, 1994; Meehl, 1954, 1965, 
1986). In clinical judgment, the decision maker 
processes information in his or her head to diag- 
nose, classify, or predict behavior. An example: A 
clinical psychologist uses experience, intuition, and 
textbook knowledge to determine whether an 
MMPI profile indicates psychosis. Psychosis is a 
broad category that includes serious mental disor- 
ders often characterized by hallucinations, delu- 


Consider psychiatric evaluation 


1 Number of physical symptoms and concern about bodily functions fairly typical 





sions, and disordered thinking. Thus, a clinician’s 
prediction of psychosis (or lack thereof) can be 
validated against external criteria such as detailed 
interview. 

In actuarial judgment, an empirically derived 
formula is used to diagnose, classify, or predict be- 
havior. An example: A clinical psychologist merely 
plugs scale scores into a research-based formula to 
determine whether an MMPI profile indicates psy- 
chosis. The actuarial prediction, too, can be vali- 
dated against appropriate external criteria. 

The essence of actuarial judgment is the care- 
ful development and subsequent use of an empiri- 
cally based formula for diagnosis, classification, or 
prediction of behavior. A common type of actuar- 
ial formula is the regression equation in which sub- 
test scores are combined in a weighted linear sum 
to predict a relevant criterion. But other statistical 
approaches may work well for decision making, 
too, including simple cutoff scores and rule-based 
flow charts. Of course, statistical rules lend them- 
selves to computer implementation, so it is fitting 
to discuss clinical versus actuarial judgment in this 
section on computer-based test interpretation. 

Although computers facilitate the use of the ac- 
tuarial method, we need to emphasize that “actuar- 
ial” and “computerized” are not synonymous. To be 
truly actuarial, test interpretations must be automatic 
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(prespecified or routinized) and based on empiri- 
cally established relations (Dawes, Faust, & Meehl, 
1989). If acomputer program incorporates such au- 
tomatic, empirically based decision-making rules, 
then it is making an actuarial prediction. Conversely, 
if a computer program embodies the thinking and 
judgment of a clinician—no matter how wise that 
person is—then it is making a clinical prediction. 

Meehl (1954) was the first to introduce the issue 
of clinical versus actuarial judgment to a broad 
range of social scientists. He stated the issue with 
pure simplicity: “When shall we use our heads 
instead of the formula?” Consider the practical prob- 
lem of distinguishing between neurosis and psy- 
chosis on the basis of MMPI results. Neurosis is an 
outdated (but still used) diagnostic term that refers 
to a milder form of mental disorder in which symp- 
toms of anxiety or dysphoria predominate. As noted 
previously, psychosis is a more serious form of men- 
tal disorder that may include hallucinations, delu- 
sions, and disordered thinking. The differential 
diagnosis between these two broad classes of men- 
tal disorder is important. Persons with neurosis often 
respond well to individual psychotherapy, whereas 
a patient with psychosis may need powerful an- 
tipsychotic medications that produce adverse side 
effects. Which is superior for MMPI-based diag- 
nostic decision making, the head of the well-trained 
psychologist or an appropriate formula based upon 
prior research? We return to this issue later. 

Meehl (1954) specified two conditions for a fair 
comparison of these contrasting approaches to de- 
cision making. First, both methods should base 
judgments on the same data. For example, in com- 
paring the experienced clinician against an actuar- 
ial equation, both approaches should prognosticate 
from the same pool of MMPI profiles and only 
those profiles. Second, we must avoid conditions 
that can artificially inflate the accuracy of the actu- 
arial approach. For example, the actuarial equation 
should be derived on an initial sample, prior to the 
comparison with clinical decision making on a new 
sample of MMPI profiles. Otherwise, the actuarial 
decision rules will capitalize upon chance relations 
among variables and produce a spuriously high rate 
of correct decisions. 


When the conditions are met for a fair test of 
clinical versus actuarial decision making, the latter 
method is superior in the vast majority of cases. 
The actuarial approach is clearly better for the task 
cited previously—differential diagnosis of neuro- 
sis or psychosis from the MMPI. L. R. Goldberg 
(1965) determined that a simple linear sum of se- 
lected MMPI scale scores resulted in 70 percent 
correct classifications, whereas Ph.D. psycholo- 
gists averaged only 62 percent, with the single best 
psychologist achieving 67 percent correct deci- 
sions. The decision rule that defeated all human 
contenders was: if the T-score sum on L + Pa + 
Sc — Hy - Pt exceeds 44, diagnose psychosis; oth- 
erwise, diagnose neurosis. ! 

Dawes, Faust, and Meehl (1989) cited nearly 
100 comparative studies in the social sciences. In 
almost every case, the actuarial method equaled or 
surpassed the clinical method, sometimes substan- 
tially. The research by Leli and Filskov (1984) is 
typical in this regard. They studied the diagnosis 
of progressive brain dysfunction based upon neu- 
ropsychological testing. An actuarial decision rule 
derived from one set of cases was applied to a 
new sample with 83 percent correct identification. 
Working from precisely the same test data, groups 
of inexperienced and experienced. clinicians cor- 
rectly identified only 63 percent and 58 percent of 
the new cases, respectively. The reader will notice 
the disturbing and embarrassing fact that experi- 
ence did not improve hit rates for this clinical 
decision-making task. 

A recent meta-analysis of 136 studies by 
Grove, Zald, Lebow, Snitz, and Nelson (2000), pro- 
vides additional support for the superiority of actu- 
arial prediction ‚over clinical prediction. These 
researchers analyzed diverse studies in the fields of 
medicine, education, and clinical psychology in 
which practitioners predicted such outcomes as 
academic performance, job success, medical diag- 
nosis, psychiatric diagnosis, criminal recidivism, 
and suicide. In each study, the clinical predictions 
of the practitioners (physicians, professors, and 


1. Respectively, the full names for these scales are L (validity 
scale), Paranoia, Schizophrenia, Hysteria, and Psychasthenia. 
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psychologists) were compared to the actuarial 
predictions derived from empirically based statisti- 
cal formulas. Although the researchers found a few 
scattered instances in which the clinical method 
was notably more accurate than the statistical 
method, on the whole, their survey confirmed prior 
findings on this topic. The authors conclude: 


Even though outlier studies can be found, we 
identified no systematic exceptions to the general 
superiority (or at least material equivalence) of 
mechanical prediction. It holds in general medi- 
cine, in mental health, in personality, and in edu- 
cation and training settings. It holds for medically 
trained judges and for psychologists. It holds for 
inexperienced and seasoned judges. (p. 25) 


Perhaps the most disturbing conclusion of these 
researchers was that the availability of clinical in- 
terview actually detracted from the accuracy of 
practitioner predictions in the diverse fields studied. 
Compared to the empirically based statistical pre- 
dictions, the clinical predictions were outperformed 
by an even greater margin when information from 
clinical interview was available to the practitioners. 
The reasons for this are unclear, but likely include 
the susceptibility of humans to certain cognitive bi- 
ases (e.g., paying too much attention to vivid inter- 
view information). Also, clinicians typically do not 
receive adequate feedback as to the accuracy of their 
judgments, and hence have no basis for correcting 
maladaptive predictions. 

The lesson to be learned from this literature is 
that computerized narrative test reports should in- 
corporate actuarial methods, when possible. For 
example, computer-generated reports should use 
existing actuarial formulas to determine the likeli- 
hood of various psychiatric diagnoses, rather than 
relying upon the programmed logic of a master 
clinician. Unfortunately, as the reader will discover 
in the following, most computerized narrative test 
reports are clinically based—which raises concerns 
about their validity. 


Actuarial Interpretation: Sample Approach 


Sines (1966) defined an actuarial interpretation as 
based upon “the empirical determination of the reg- 
ularities that may exist between specified psycho- 


logical test data and equally clearly specified so- 
cially, clinically, or theoretically significant nontest 
characteristics of the person tested.” In other words, 
the statements in an actuarial interpretation are not 
derived from conjecture or clinical lore, they are 
based upon specific, quantified research findings. 

Because of the investigative effort required, ac- 
tuarial approaches to computer-based test interpre- 
tation are rare. The first actuarial systems were 
based upon the MMPI (e.g., Gilberstadt & Duker, 
1965; Marks & Seeman, 1963). More recently, ac- 
tuarial interpretive systems have been applied to the 
Personality Inventory for Children (Lachar & 
Gdowski, 1979; Lachar, 1987), the Marital Satis- 
faction Inventory (Snyder, Lachar, & Wills, 1988), 
and the California Psychological Inventory (Gough, 
1987). A specific example will clarify this method. 

The developers of the Personality Inventory for 
Children (PIC) produced an exemplary system for 
computer-based actuarial test interpretation, which 
we will describe for illustrative purposes. The 
reader will recall from the previous chapter that the 
PIC, now updated as the PIC-2, is a true-false in- 
ventory that the parent or caregiver completes with 
respect to the child’s behavior. Based upon these 
responses, a profile of T scores (mean of 50, SD of 
10) is produced for four validity scales (e.g., De- 
fensiveness), 12 clinical scales (e.g., Delinquency), 
and four factor scales (e.g., Social Incompetence). 
In total, T scores are reported for 20 scales on the 
PIC. Of course, higher T scores indicate a greater 
likelihood of psychopathology. 

Actuarial interpretation of the PIC rests upon 
the empirically derived correlations between indi- 
vidual scales and important nontest criteria. Re- 
search subjects for the Lachar and Gdowski (1979) 
study consisted of 431 children referred to a busy 
teaching clinic. As part of the evaluation process 
for each child, the staff members, parents, and 
teachers completed a comprehensive questionnaire, 
which listed 322 descriptive statements concerning 
behavior and other variables. In addition, parents 
or caretakers filled out the PIC. 

In the first phase of the actuarial study, the 322 
descriptive statements were correlated with the 20 
PIC scales to identify significant scale correlates. In 
the second phase, the significant correlates were an- 
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alyzed further to determine the relationship between 
descriptive statements and T-score ranges on the PIC 
scales. The outcome of this prodigious effort was a 
series of actuarial tables not unlike the tables used by 
insurance companies to predict the likelihood of ill- 
ness, death, accidents, and the like, based upon pop- 
ulation demographics such as age, sex, and residence. 
Some examples of actuarial correlates of the Delin- 
quency, or DLQ, Scale are depicted in Table 15.2. 

Actuarial tables capture a wealth of information 
useful in clinical practice. Consider two hypothet- 
ical 12-year-old children, Jimmy and Johnny, each 
referred to a clinician with the same presenting 
problem: school underachievement. As part of the 
intake procedure, the clinician asks each mother to 
fill out the PIC. Suppose that the Delinquency, or 
DLQ, Scale score for Jimmy is highly elevated at a 
T score of 114, whereas Johnny obtains an average 
range T score of 54. Based upon these scores, the 
clinician would know the likelihood—listed here as 
percentages—that certain behavioral descriptions 
apply to each child: 


Jimmy Johnny 
(DLQ=114) (DLQ=54) 
Refuses to go 


to bed 42% 18% 
Lies 90% 44% 
Uses drugs 32% 0% 
Rejects school 56% 16% 
Involved with police 58% 0% 


The reader will immediately recognize that Jimmy 
fits a pattern of pervasive conduct disorder, whereas 
Johnny appears to have few such behavior prob- 
lems. In Jimmy’s case, the underachievement is 
most likely secondary to a pattern of antisocial be- 
havior, whereas for Johnny the clinician must look 
elsewhere to understand the school failure. Of 
course, this is only a small fraction of the informa- 
tion that would be available from a computer-based 
actuarial interpretation of the PIC. In a full report, 
the clinician would receive statistics and narrative 
statements pertinent to all 20 scales from the PIC. 


Computer-Assisted Clinical Reports 


In a computer-assisted clinical report, the interpre- 
tive statements assigned to test results are based 
upon the judgment of one or more expert clinicians. 
The expert clinicians formalize their thought 
processes and develop automated decision rules 
that are then translated into computer code. This 
method differs crucially from the computer-assisted 
actuarial approach in which interpretive statements 
are based strictly upon formal research findings. 
Superficially, the two approaches may appear to be 
identical insofar as each is rule-based and auto- 
mated. The difference has to do with the origin of 
the rules: empirical research (actuarial approach) 
versus clinician judgment (clinical approach). 
Even though clinicians generally recognize the 
superiority of the actuarial method, there is one 


TABLE 15.2 Occurrence Rates for Actuarial Descriptors of the PIC Delinquency Scale 


T-Score Ranges 





Base 
Descriptor Rate* 30-59 60-69 80-89 90-99 100-109 110-119 >120 
Refuses to go to bed 30 26 33 36 33 42 38 
Lies 2123662 36 73 71 79 90 91 
Uses drugs 12 2 7 11 18 32 53 
Rejects school 40 26 42 50 47 56 67 
Involved with police 17 4 10 21 19 58 63 





*Percentage of all children rated as displaying the characteristic. 


Note: These five descriptors are merely a representative sample of the 51 actuarial correlates of the Delinquency Scale. 


Source: Material from Actuarial Assessment of Child and Adolescent Personality: An Interpretive Guide for the Personality Inventory for 
Children Profile copyright © 1979 by Western Psychological Services. Reprinted by permission of the publisher, Western Psychological Ser- 
vices, 12031 Wilshire Boulevard, Los Angeles, CA 90025, United States of America. 
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significant advantage to the computer-assisted clin- 
ical approach. The advantage is that the clinical ap- 
proach can be designed to interpret all test profiles, 
whereas some test profiles will be uninterpretable by 
means of an actuarial approach. The discouraging 
truth about actuarial “cookbook” systems for test in- 
terpretation is that the classification rate usually 
plummets when a system is used in a new setting. 
The classification rate refers to the percentage of test 
results that fit the complex profile classification rules 
necessary for actuarial interpretation. For example, 
in the Gilberstadt and Duker (1965) actuarial MMPI 
system, the 1-2-3 code type is defined by these rules 
for the Hs (Hypochondriasis), D (Depression), Hys- 
teria (Hs), and L, F, K (validity) scales: 


1. Hs, D, and Hy over T score 70 

2. Hs > D > Hy 

3. No other scales over T score 70 

4. L <T score 66, F < T score 86, and K < T score 
71 


Persons who produce this kind of MMPI profile 
often suffer from psychophysiological overreactiv- 
ity, not to mention a host of other empirically con- 
firmed characteristics. Of course, there are several 
additional code types, each defined by a set of com- 
plex decision rules, and each accompanied by an 
elaborate, actuarially based description of personal- 
ity and psychopathology. A typical finding is that a 
computer-assisted actuarial system developed within 
one client population will be capable of interpreting 
up to 85 percent of the test profiles encountered in 
that setting. However, when the actuarial system is 
applied to a new client population, perhaps 50 per- 
cent of the test profiles will fit the decision rules. 
This means that about half of the test profiles do not 
fit the rules. At best, these clients will receive a su- 
perficial, scale-by-scale interpretation rather than a 
more sophisticated actuarial interpretation based 
upon code types. The problem of shrinkage in clas- 
sification rate is observed in virtually all studies of 
actuarial interpretation (Moreland, 1992). 
Computer-assisted clinical reports tend to be 
lengthy and detailed, full of scale scores, item in- 
dices, and graphs. Of course, these reports also in- 
clude several pages of narrative report, usually 
phrased in terms of hypotheses as opposed to con- 


firmed findings. The shortest such report is about 
six pages (e.g., the Karson Clinical Report for the 
16 PF), whereas longer ones can run to 10 or 20 
pages (e.g., MMPI-2 interpretations). 


| HIGH-DEFINITION VIDEO 
I) AND VIRTUAL REALITY: 
Ill THE NEW HORIZONS OF CAPA 


With recent improvements in technology, the mod- 
ern microcomputer has opened up a whole new 
world for psychological assessment. The ordinary 
personal computer is now capable of presenting 
video segments that possess the visual clarity of 
television. Built-in stereo sound systems produce 
exquisite audio output, including synthesized 
human speech that passes for the real thing. With 
CD-ROM accessories, instantaneous access to 
huge repositories of information—including still 
images, live video segments, music, tables, charts, 
animation—is possible. Collectively, these capaci- 
ties are known as multimedia—especially when 
used for interactive and educational applications. 
Multimedia is one new horizon of computer- 
assisted psychological assessment. Another is vir- 
tual reality. 

At IBM, researchers have been developing the 
Workplace Situations test to assess job applicants 
for manufacturing positions (Drasgow, Olson- 
Buchanan, & Moberg, 1999). What is unique about 
the test is the nature of the stimuli. Rather than 
merely describing work situations, the test displays 
computer-driven interactive video of realistic work 
scenes. The assessment consists of 30 short scenes 
in a fictional organization named Quintronics. The 
scenes depict work-related interpersonal episodes 
arising in the manufacture of hypothetical elec- 
tronic products called quintelles and alpha pinhole 
boards. The computer vignettes depict such con- 
cerns as excessive workloads, poor training, inter- 
personal conflict, poor productivity, and flawed 
work. Each scene is presented and then the screen 
pauses with a description of five ways of respond- 
ing to the workplace problem. The scenes have a 
highly realistic feeling to them, which enhances the 
face validity of the test. This kind of interactive 
video test likely provides a more accurate assess- 
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ment than paper-and-pencil tests of how people 
would actually respond on the job. Tests that use in- 
teractive video are especially good at tapping ex- 
aminees’ abilities to deal with complex, real-life 
problems, such as decision making under time 
pressure or conflict resolution in the work place. 
Olson-Buchanan et al. (1998) have developed 
an interactive video test of conflict resolution that 
reveals both the promise and the perils of this new 
technology. Their instrument, the Conflict Resolu- 
tion Skills Assessment (CRSA), consists of nine 
conflict scenes, each with the potential for multiple 
branchings, depending upon the examinee’s ongo- 
ing response pattern: 
A typical item on the Conflict Resolution Skills 
Assessment begins by presenting a conflict scene 
(1-3 minutes in duration) to an individual. At a 
critical point the scene is stopped and four options 
for addressing the conflict are provided; the as- 
sessee is asked to choose the option that best de- 
scribes what he or she would do in this situation. 
Depending on the option chosen, the computer 
branches to an extension of the first scene depict- 
ing how events might unfold. Again, the conflict 
escalates, the scene is frozen, four options for ad- 
dressing the conflict are presented, and the assessee 
decides which option would best resolve the con- 
flict. The computer then branches to an entirely 
new conflict scene. (p. 180) 


The perils of this effort include the increased ex- 
pense required for test development (e.g., cost of 
producing high-quality, convincing videos) as well 
as daunting theoretical issues (e.g., challenge of con- 
ceptualizing “good” conflict resolution skills). This 
kind of interactive, branching, video-based test also 
poses unique psychometric problems. For example, 
how do you assess the reliability of specific subele- 
ments of the test when only a few of the examinees 
may have taken that “route” through the test? 

In spite of these challenges, the development of 
pathbreaking instruments such as the CRSA is well 
worth the effort. Consider one important payoff, 
namely, scores on the CRSA show essentially no 
correlation with general cognitive ability (Drasgow 
et al., 1999). Psychologists long have suspected 
that social skills are distinct from cognitive skills, 
but when both are assessed with traditional paper- 
and-pencil instruments, moderate to strong corre- 


lations are the rule. Most likely, this is because of 
shared method variance, namely, verbal test-taking 
skills help an examinee navigate any paper-and- 
pencil test, regardless of the construct being mea- 
sured. By using interactive video as the primary test 
stimulus, instruments such as the CRSA provide a 
purer measure of social skills than paper-and-pen- 
cil tests. This unique instrument illustrates that so- 
cial skills contribute something different. than 
cognitive skills to effective work performance. 

Another potential application of multimedia is 
in personnel screening for entry-level police offi- 
cers. Law enforcement personnel must have good 
observational and evaluative skills, which can be 
assessed realistically with video stimuli. For ex- 
ample, an assessment might consist, in part, of a 
videotape of witnesses at a crime scene. Police can- 
didates might be asked to determine the truth of the 
witnesses and to draw conclusions about the crime 
based upon their observational powers (APA Mon- 
itor, June 1994). This example—currently hypo- 
thetical—illustrates the potential for multimedia to 
revolutionize psychological assessment. 

It is worth noting that multimedia tests can be 
virtually free of reading and writing requirements 
on the part of the examinee. Talented job candidates 
who do not possess good reading or writing abili- 
ties but who do have practical job skills can be iden- 
tified by means of multimedia tests. For some jobs, 
multimedia might be fairer than the paper-and- 
pencil approach. 

Finally, a very recent high-tech approach to 
computer-based assessment deserves brief men- 
tion. In virtual reality, the participant wears a pair 
of goggles that transmit realistic, three-dimensional 
images of a simulated environment. By manipulat- 
ing simple control devices, the participant can 
navigate through the environment even though 
standing still. Of course, the visual environment, 
known as a virtual reality, is based on sophisti- 
cated computerized output. 

New assessment tools that utilize virtual reality 
(VR) are in their infancy, but show great promise. 
For example, Kesztyues, Mehlitz, Schilken, and 
others (2000) describe a VR system for the assess- 
ment of spatial orientation disorders in neuro- 
logical referrals. They compared the traditional 
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head-mounted display with a wall projection sys- 
tem as patients “navigated” their way through vir- 
tual environments such as a park or a maze. This 
assessment system holds promise, although the re- 
searchers did encounter unexpected problems such 
as some patients experiencing nausea when using 
the head-mounted display. Elkind, Rubin, Rosen- 
thal, Skoff, and Prather (2001) describe a promis- 
ing VR test of real-life skills required for safe, 
independent living. Numerous innovative tests 
based upon VR can be found in the new journal Cy- 
berPsychology and Behavior. Riva (1997) has as- 
sembled relevant articles on the promise and 
pitfalls of VR in psychological assessment. 


EVALUATION OF COMPUTER-BASED 
TEST INTERPRETATION 


Computerized testing has clear advantages but also 
some potentially serious disadvantages in com- 
parison to the traditional clinical approach to psy- 
chological testing. We offer a brief survey here, 
stressing both the advantages and disadvantages of 
computer-based testing, diagnosis, and report writ- 
ing. More detail on this topic can be found in 
Butcher (1987), Moreland (1992), Roid and John- 
son (1998), Butcher, Perry, and Atlis (2000), and 
Mills, Potenza, Fremer, and Ward (2002). 


Advantages of Computerized Testing 
and Report Writing 


The main advantages of computer-based testing are 
quick turnaround, inexpensive cost, near-perfect 
reliability, and complete objectivity. In addition, 
some measurement applications such as flexible 
adaptive testing virtually require the use of com- 
puters for their implementation. We explore these 
points in more detail later. 

In a busy clinical practice, delays between test- 
ing and submission of the consulting report are 
common, almost inevitable. These delays not only 
tarnish the reputation of the consultant, they may 


also adversely affect the treatment outcome for the - 


client. For example, a college student with learning 
disabilities may need immediate intervention in 
order to avert an academic disaster. A delay of two 


or three weeks in submission of a consulting report 
could spell, indirectly, the difference between 
failure and success in academic performance. Com- 
puter-based reports can speed up the entire consul- 
tation process. Many software systems produce 
reports that can be transferred into a standard word- 
processing program for immediate customized edit- 
ing, thereby speeding up the turnaround time (e.g., 
Psychological Corporation, 1994; Tanner, 1992). 

Cost is another consideration in computer- 
based testing. Although there are no definitive 
studies on this topic, most authorities assert that 
computer-scored and interpreted psychological 
tests cost considerably less than those produced en- 
tirely by clinician effort (Butcher, 1987). In their 
studies of automated testing at the Salt Lake City 
VA Hospital, Klingler, Miller, Johnson, and 
Williams (1977) concluded that the computer cut 
the cost of testing in half. Certainly as the comput- 
erized testing programs become more sophisticated 
and are used by larger numbers of clinicians, the 
cost per consultation will plummet. 

Reliability and objectivity are the hallmarks of 
the computer. Assuming that the software is accu- 
rate and error-free, computers simply do not make 
clerical scoring errors, nor do they vary their meth- 
ods of stimulus presentation from one day to the 
next, nor do they yield different narrative reports 
based on the same input. The product is the same no 
matter how many times the computer program is 
used. Furthermore, because computerized reports 
are based on objective rules, they are not distorted 
by halo effects or other subjective biases that might 
enter into a clinically derived report. Butcher (1987) 
asserts that computerized reports could have special 
significance in court cases, because they would be 
viewed as “untouched by human hands.” This is an 
intriguing possibility, but perhaps somewhat overly 
optimistic. Lawyers and judges will still want to 
know who programmed the software, how the nar- 
rative statements were developed, and so on. 


Disadvantages of Computerized Testing 
and Report Writing 


Consider the following illustration, hypothetical 
yet realistic and probably not a rare occurrence. A 
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hospital physician refers a difficult medical patient 
to the psychology service for a personality evalua- 
tion. The patient is escorted to the testing center 
where a receptionist seats him at a table in front of 
a microcomputer. Instructions appear on the com- 
puter monitor to answer a series of self-statements 
true or false by pressing the T or F key. The patient 
completes the computerized objective personality 
inventory and is escorted back to the medical ser- 
vice. Seconds later, a narrative report based on the 
patient’s responses emerges from the printer. The 
consulting psychologist peruses the report briefly, 
then sends it (unsigned) through departmental mail 
to the physician. The report is handsome, ever so 
crisp in its laser-printed appearance, with a graphic 
summary of scales on the cover page. Furthermore, 
the narrative is valid-sounding and reads as if it 
were copyedited by a professional writer (in fact, it 
was). The physician is impressed and takes the re- 
port to heart, making treatment decisions based on 
the personality evaluation. 

This scenario illustrates an essential quandary 
with computer-based testing and report writing: 
Computers can so dominate the testing process that 
the clinical psychologist is demoted to a mere 
clerk—or is removed from the assessment loop en- 
tirely. Although most psychologists acknowledge 
that computers are a welcome addition to the prac- 
tice of psychological testing, critics have raised a 
number of disquieting concerns about recent as- 
sessment practices such as those depicted here. 
Computerization of the testing process raises prac- 
tical, legal, ethical, and measurement issues that de- 
serve thoughtful review. 

In general, skeptics do not attack the practice of 
computerizing the mechanics of test administration 
and scoring; these computer applications are seen 
as efficient and appropriate uses of modern tech- 
nology. Nonetheless, even the most ardent propo- 
nents acknowledge the need to investigate test-form 
equivalency when an existing test is adapted to 
computerized administration. In particular, practi- 
tioners should not assume that the computerized 
adaptation and the original version of a test produce 
identical results. Equivalency is an empirical issue 
that must be demonstrated by appropriate research. 
For most tests, equivalency can be demonstrated, 


but this must not be taken for granted (Lukin, 
Dowd, Plake, & Kraft, 1985; Schuldberg, 1988). 

Some tests do not maintain score equivalency 
when translated to computer. The Category Test 
(CT) from the Halstead-Reitan Neuropsychologi- 
cal Battery is a case in point. In a comparison of 
computerized and standard versions of the Cate- 
gory Test with rehabilitation patients, Berger, Chib- 
nall, and Gfeller (1994) found a huge difference in 
error rate for two groups of subjects who had equiv- 
alent backgrounds: an average of 84 errors on the 
computerized CT versus an average of 66 errors on 
the standard CT test. Apparently, the computerized 
CT test is much more difficult than the standard 
version, which means that separate norms must be 
developed for its interpretation. Much smaller dif- 
ferences between computerized and standard test 
administration have also been reported for the 
MMPI, with computer-based scores tending to un- 
derestimate (very slightly) the booklet-based scores 
(Watson, Thomas, & Anderson, 1992). 

The main focus of controversy in computer- 
based test interpretation is computerized report 
writing. Several prominent experts in psychologi- 
cal testing have expressed grave reservations about 
the routine practice of automating the narrative re- 
ports that must accompany any clinical assessment 
(Faust & Ziskin, 1989; Lanyon, 1984; Matarazzo, 
1986, 1990; McMinn, Ellens, & Soref, 1999). The 
primary concerns include the following: 


e Computerized psychological testing is a poor 
substitute for psychological assessment. 

e Computerized narrative reports are rarely vali- 
dated prior to use. 

e Computerized clinical psychological interpreta- 
tions are unsigned. 


These points are embellished in the following 
paragraph. 

Matarazzo (1986) has emphasized that a 
computer-based evaluation is so easy to accomplish 
and so impressive in appearance that both users 
and recipients of a computerized narrative report may 
confuse it with a comprehensive assessment. But 
there is a difference between testing and assessment. 
The experienced clinician knows that psychological 
testing—whether traditional or computerized—is 
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just the first step in a comprehensive assessment. In 
performing a comprehensive assessment, the com- 
petent clinician goes beyond the test results to inte- 
grate the findings into the examinee’s total life 
situation and psychological history. In contrast, a 
computerized narrative report rarely makes refer- 
ence to nontest information such as the purpose of 
the assessment, the client’s recent adaptive func- 
tioning, or interview impressions that might strongly 
contradict test-based inferences about personality. 

In his critique of computer-based testing, 
Lanyon (1984) has emphasized that the printed in- 
terpretations lack demonstrated validity. Indeed, 
most automated test interpretations are based upon 
clinical lore rather than empirical validation—that 
is, they are clinical rather than actuarial in nature. 
Although vendors may conduct customer satisfac- 
tion studies (e.g., “How satisfied are you with nar- 
rative statement X on client Y?’’) these analyses are 
no substitute for careful scrutiny of interpretive va- 
lidity. Lanyon (1984) also notes with alarm that 
commercially available automated test interpreta- 
tion systems are growing exponentially. Some sys- 
tems can be purchased by almost anyone for home 
installation on a microcomputer. These develop- 
ments portend a dark future: 


There is a real danger that the few satisfactory ser- 
vices will be squeezed out by the many unsatisfac- 
tory ones, since the consumer professionals are 
generally unable to discriminate among them and 
are predisposed to believe whatever is printed. Par- 
ticularly distressing is that the lack of demonstrated 

» program validity has now become the norm, and 
there appear to be no checks against the further de- 
velopment of this untenable situation. Perhaps the 
time has now come when federal regulations for 
this industry are necessary for consumer protec- 
tion. (Lanyon, 1984) 


Matarazzo (1986) sounds a similar note, warning 
that the profession of psychology must regulate it- 
self or risk federal incursion into the practice of 
psychological testing. 

Two decades after Lanyon (1984) sounded an 
early alarm, the validity of computer-based test 
interpretations is still an ongoing concern. In fact, 
when computer-authored assessments are compared 


with clinician-authored assessments, the latter often 
emerge as superior in their clinical usefulness. Con- 
sider a recent study by Epstein and Rotunda (2000) 
in which a large sample of practicing psychologists 
viewed either a computer-based test report or a clin- 
ician-authored report, both derived from MMPI-2 
results. The psychologists were asked to assess the 
patients (in terms of symptom ratings) based upon 
the information received. These judgments were 
then compared with the symptom ratings made by 
staff where the patients were hospitalized. The re- 
sults indicated that the practicing psychologists 
who received clinician-authored reports were bet- 
ter able to match hospital staff ratings of patient 
symptomatology than psychologists who received 
computer-based reports. After reviewing studies on 
this topic, Butcher et al., (2002) reach this conclu- 
sion: “Research thus far appears to indicate that 
computer-generated reports should be viewed as 
valuable adjuncts to, rather than substitutes for, 
clinical judgment. Additional studies are needed to 
support broadened computer-based test usage” 
(p. 6). McMinn et al., (1999) surveyed 364 mem- 
bers of the Society for Personality Assessment and 
determined that, in fact, most members use com- 
puter-based test reports as a complementary source 
of input for case formulations; few members use 
computer-based testing as the primary way to for- 
mulate a case, and rarely is it used as an alternative 
to a written report. 

Another problem with computerized testing is 
that automated narrative reports are rarely signed. 
Unsigned reports raise horrendous issues with 
regard to professional responsibility and legal cul- 
pability. Suppose a client is harmed by an inaccu- 
rate computerized report. Who is to blame? Who 
is legally responsible? Fault could rest with the 
psychologist who used the computer program, the 
for-profit company that sold it, or the individual 
programmer who incorporated the offending state- 
ment(s) into the software logic. But legal re- 
sponsibility usually settles upon the individual 
psychologist. The lesson is that using an automated 
system does not absolve the practitioner of respon- 
sibility for the consequences of a computerized re- 
port (Case Exhibit 15.1). 
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In the mid-1980s, the American Psychological The general purpose of the guidelines is to inter- 
Association adopted Guidelines for Computer- pret the Standards for Educational and Psycho- 
Based Tests and Interpretations (1986). These logical Testing (AERA, APA, NCME, 1985) as 
guidelines are reprinted in Butcher (1987, App. B). they relate to computer-based testing and test 
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interpretation. Two important guidelines include 
the following: 


1. The extent to which statements in an interpretive 
report are based on quantitative research versus 
expert clinical opinion should be delineated. 

2. When statements in an interpretive report are 
based on expert clinical opinion, users should be 
provided with information that will allow them 
to weigh the credibility of such opinion (Butcher, 
1987). 


The guidelines also detail the many other require- 
ments that test producers, test publishers, and test 
users should meet. Perhaps these steps toward self- 
regulation will help increase the respectability and 
acceptability of computerized report writing. If not, 
increased federal incursion into the practice of psy- 
chological testing would appear to be inevitable. 


[||| COMPUTERIZED ADAPTIVE TESTING 


A final advantage of computer-based testing is its 
application to flexible adaptive testing. Adaptive 
testing is nothing new—Binet used it when he 
worked out the methods for finding the basal and 
ceiling items on his famous intelligence test. Binet 
placed his items along a continuum of difficulty so 
that the examiner could test downward to find the 
examinee’s basal level and test upward to find the 
ceiling level. This procedure eliminated the need to 
administer irrelevant items—those so easy (below 
the basal level) that the examinee would surely pass 
them, or those so hard (above the ceiling level) that 
the examinee would surely fail them. Another ex- 
ample of adaptive testing is the two-stage proce- 
dure whereby results on an initial routing test are 
used to determine the entry level for subsequent 
scales. For example, on the Stanford-Binet: Fifth 
Edition, results of the initial vocabulary and matri- 
ces subtests determine the starting points for sub- 
sequent subtests. By reducing the time needed to 
obtain an accurate measure of ability, adaptive test- 
ing fulfills a very constructive purpose. 
Computerized adaptive testing (CAT) is a 
family of procedures that allows for accurate and 
efficient measurement of ability (Wainer, 2002). 


Although details differ from one method to another, 
most forms of computerized adaptive testing share 
the following features: 


1. Based on extensive pretesting, the item response 
characteristics of each item (e.g., percentage 
passing versus ability) are appraised precisely. 

2. These item response characteristics and a CAT 
item-selection strategy are programmed into the 
computer. 

3. In selecting the next item for presentation, the 
computer uses the examinee’s total history of re- 
sponses up to that point. 

4. The computer recalculates the examinee’s esti- 
mated ability level after each response. 

5. The computer also estimates the precision of 
measurement (e.g., standard error of measure- 
ment) after each response. 

6. Testing continues until a predetermined level of 
measurement precision is reached. 

7. The examinee’s score is based on the difficulty 
level and other measurement characteristics of 
items passed, not on the total number of items 
correct. 


The measurement advantages of CAT can be 
summarized in two words: precision and efficiency 
(Weiss & Vale, 1987). Regarding precision, CAT 
guarantees that each examinee is measured with the 
same degree of precision because testing continues 
until this criterion is met. This is not so with tradi- 
tional tests in which scores at both tails of the dis- 
tribution reflect greater levels of measurement error 
than scores in the middle of the distribution. Re- 
garding efficiency, the CAT approach requires far 
fewer test items than are needed in traditional test- 
ing. For example, written certification examinations 
usually include 200 to 500 items, while CAT exam- 
inations are always shorter, often including fewer 
than 100 items to achieve a more accurate level of 
measurement (Lunz & Bergstrom, 1994). In one 
analysis, the reliability of alternative computer- 
adaptive tests for certification in medical technology 
was .96 (Lunz, Bergstrom, & Wright, 1994). This is 
remarkable because shorter tests (the goal in CAT 
testing) tend to have lower reliability than longer 
tests (such as found in traditional testing programs). 
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In addition to increased measurement efficiency, 
CAT has many other advantages over traditional 
paper-and-pencil assessment (Wainer, 2000, p. 11): 


Test security is improved. 

Examinees work at their own pace. 

Examinees are equally challenged. 

Answer sheets pose no ambiguity (e.g., erasures). 
Immediate scoring and feedback is possible. 
Pretesting of new items can be included. 

Faulty items can be eliminated immediately. 

A variety of question types can be included. 


Regarding the last point, examples of novel item 
types not possible on a traditional multiple-choice 
exam include spoken words (such as for a spelling 
test), open-ended math problems (the answer is 
typed in), and video segments (followed by written 
questions). 

The CAT approach to psychological testing has 
been used mainly by large organizations such as the 
U.S. Army and the Educational Testing Service for 
assessment of intelligence and special abilities. In 
recent years, national licensing boards (e.g., in 
medicine) have begun to implement CAT testing 
because of convenience in scheduling tests, tighter 
control over test security, reduced costs, and the op- 
portunity for better data collection (Lunz & 
Bergstrom, 1994). Technical information on CAT 
systems is proprietary and difficult to obtain. 
Nonetheless, it is clear that the efficiency of the 
CAT approach is substantial. CAT uses fewer items 
of better quality than a conventional test of the 
same length. A general finding is that CAT reduces 
test length by about 50 percent, with reductions for 
individual examinees of up to 80 percent, with no 
loss in measurement accuracy (Laatsch & Choca, 
1994; Weiss & Vale, 1987). 

As the cost of computing continues to plummet, 
more and more large-scale applications of CAT will 
be developed. In the late 1990s, the Educational 
Testing Service moved toward near total reliance 
on CAT versions of the Graduate Record Exami- 
nation and other selection tests. Licensing and cer- 
tification boards such as the National Council of 
State Boards of Nursing also have introduced CAT 
versions of their certification tests. Mills and Stock- 


ing (1996) discuss practical issues in large-scale 
computerized testing. 


ll THE FUTURE OF TESTING 


As the twenty-first century begins, what is the fu- 
ture of psychological testing? We will hazard a few 
speculations here, cognizant that prognostications 
about the future often are wrong. Forecasting de- 
velopments in testing is especially difficult because 
the enterprise is increasingly constrained, directly 
or indirectly, by public opinion. For example, at one 
point in the 1980s the legislature of the state of Cal- 
ifornia made it illegal for school psychologists to 
use traditional intelligence tests as a basis for plac- 
ing students in special education classes. These re- 
straints on testing were driven by public outrage 
over the excessive placement of minority students 
in special education classes. Thus, even when a par- 
ticular technology of testing is feasible and pro- 
moted by psychologists, there exists the possibility 
that it might be strictly controlled or even banned. 

A case in point is Matarazzo’s (1992) predic- 
tion that biological measures of intelligence will 
gain prominence in the twenty-first century. Cer- 
tainly it appears true that biological measures of 
ability such as averaged evoked potential (gauged 
from EEG waves), or glucose metabolic rate in the 
brain (gauged from PET scans), or relative brain 
size (gauged from MRI scans) will prove to be ef- 
fective approaches to assessment (see Topic 5A, 
Theories and the Measurement of Intelligence). But 
Matarazzo (1992) goes further in asserting that 
these and other biological approaches actually will 
receive common usage: 


Therefore, another of my predictions is that in the 
early decades of the 21st century we may see the 
further development and use in practice of these 
and other biological indices of brain function and 
structure in a test (or a test battery) for the mea- 
surement of individual differences in mental abil- 
ity, thus heralding the first clear break from test 
items and tests in the Binet tradition in a century. 
(p. 1012, italics in the original) 


While Matarazzo’s prediction could come true, a 
more likely scenario is that the general public will 
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be threatened when biological indices are used in 
assessment and will therefore take steps (e.g., pres- 
sure on legislators) to ensure that such measures re- 
ceive limited (if any) application. The public will 
be threatened because, rightly or wrongly, biologi- 
cal characteristics such as glucose metabolic rate in 
the brain are perceived to be relatively permanent 
and immutable. The fear will arise that biological 
tests will sort people into a caste system. Even if 
(or when) the validity of biological tests is firmly 
established, it will be decades (if ever) until they 
are found acceptable by the general public. 

The computerization of testing, on the other 
hand, is already a fixture of industrialized societies 
and this trend can only increase in the future. Ex- 
isting tests will be adapted to the desktop computer 
with increasing regularity. An example of this trend 
is Fepsy (Ferrum + Psyche), a system for auto- 
mated neuropsychological testing that is available 
online at 220 sites throughout the Netherlands and 
most of Europe. Fepsy is described on the Internet 
at www.euronet.nl/users/fepsy. Fepsy consists of 
the following subtests: 


e Auditory reaction time 

e Binary choice reaction time 
e Tapping task 

e Visual searching task 

e Recognition tasks 

e Vigilance task 

¢ Rhythm task 

e Classification task 

6 Visual Half field tasks 

e Corsi block tapping 


A common use is pre- and postoperative testing of 
patients who undergo epilepsy surgery for relief 
of seizures. The system has even been used with 
fully conscious patients during surgery. Under local 
anesthesia, the patient works on a subtest while si- 


multaneously receiving harmless electrical stimu- 
lation at distinctive sites on the cortex. The purpose 
is to determine whether specific cognitive functions 
might be affected when scar tissue is excised from 
the brain. The advantage of using a multicenter, 
multinational, computerized testing system is that 
the examiner has access to normative data for thou- 
sands of patients with specific conditions. 

Another prediction is that fewer and fewer 
wide-spectrum tests (e.g., personality inventories 
and individual intelligence tests) will be released 
by test publishers (Gregory, 1998). Instead, pub- 
lishers will concentrate on tests designed to assess 
particular areas of functioning for specific target 
populations (e.g., measures of memory functioning 
for elderly persons suspected of having dementia). 
The reasons for these complementary trends are 
economic: 


Test publishing is big business, a respectable way 
for large corporations to earn a profit. Publishers 
will be reluctant to make the major investment 
needed to develop new instruments that have the 
grandiose ambition of assessing many aspects of 
personality or intellect for a wide range of subjects. 
The cost is too high and—in light of the existing 
competition—the risk is too great. (Gregory, 1998, 
pp. 76-77) 


Test publishers likely will focus on less-expensive 
and less-risky forms of test development such as in- 
struments that embody distinctive constructs rele- 
vant to specific target groups. Examples might 
include tests to measure risky behaviors in adoles- 
cents, mental decline in elderly persons, faulty cog- 
nitions in depressed persons, or communication 
problems in maritally distressed couples. These 
kinds of instruments will flourish, whereas pub- 
lishers will rarely invest in new omnibus tests of 
personality or ability, preferring instead to revise 
and recycle existing instruments. 


SUMMARY 


1. The term computer-assisted psychological 
assessment (CAPA) refers to the entire range of 
computer applications in psychological assess- 


ment. This includes administration, scoring, and in- 
terpretation of tests; computerized adaptive testing; 
and sophisticated multimedia applications. 
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2. The first use of computers to provide test 
interpretations can be traced to the Mayo Clinic in 
the early 1960s. This MMPI interpretive system 
supplied brief scale-by-scale statements based 
upon clinical lore. 


3. Computer-based test interpretation (CBTI) 
is now available for virtually every published 
psychological test. Four approaches to CBTI are 
recognized: scoring reports, descriptive reports, 
actuarial reports, and computer-assisted clinical 
reports. 


4. Scoring reports consist of only scores 
and/or profiles, but may include statistical signifi- 
cance tests and confidence intervals plotted for the 
test scores. These reports highlight meaningful 
scores and score differences at a glance. 


5. A descriptive report provides brief scale- 
by-scale interpretation of test results. These reports 
are especially useful when test findings are con- 
veyed to mental health professionals who have lit- 
tle knowledge of the test in question. 


6. In actuarial test interpretation, an empiri- 
cally derived formula is used to diagnose, classify, 
or predict behavior. This is in contrast to the clini- 
cal approach in which the psychologist processes 
information in his or her head to diagnose, classify, 
or predict behavior. 


7. Empirical comparisons of clinical versus 
actuarial test interpretation find the latter to be su- 
perior in virtually every case. Computerized test in- 
terpretations should incorporate actuarial methods, 
when possible. 


8. In a computer-assisted clinical report, the 
interpretive statements are based upon the auto- 
mated and computerized judgment of one or more 
expert clinicians. This approach allows for inter- 
pretation of all test profiles, not just those that fit 
certain actuarial patterns. 


9. The advantages of computer-based test in- 
terpretation include objectivity, speed, and low 
cost. A major disadvantage is the danger that the 
psychologist could be excluded from the assess- 
ment process entirely, which increases the risk that 
test results will be misused. 


10. Multimedia includes realistic, interactive 
presentation of test stimuli via computer (e.g., 
video display of a work situation). Multimedia al- 
lows for the testing of complex, real-life problems, 
such as conflict resolution in the workplace. 


11. Computerized adaptive testing (CAT) is a 
family of procedures that allows for accurate and 
efficient measurement of ability. In this approach, 
the computer guides item selection based upon 
prior examinee answers. 


12. The purpose of CAT is to reach a predeter- 
mined level of measurement accuracy with as few 
test items as possible. A typical finding is that CAT 
reduces test length by about 50 percent with no loss 
in measurement accuracy. 


13. The future of testing is difficult to predict. 
Whereas some authorities predict an increase in use 
of biological measures of intelligence, this is un- 
certain. Certainly the increased computerization of 
testing is one clear trend that can only intensify. 
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Testing of Cultural and Linguistic Minorities 

Reprise: Responsible Test Use 


Summary 
Key Terms and Concepts 


l he general theme of this book is that psycho- 
logical testing is a beneficial influence in 
modern society. When used ethically and responsi- 
bly, testing provides a basis for arriving at sensible 
inferences about individuals and groups. After all, 
the intention of the enterprise is to promote proper 
guidance, effective treatment, accurate evaluation, 
and fair decision making—whether in one-on-one 
clinic testing or institutional group testing. Who 
could possibly complain about these goals? 

Thankfully, tests generally are applied in an ethi- 
cal and responsible manner by psychologists, edu- 
cators, administrators, and others. But there are ex- 
ceptions. Almost everyone has heard the horrific 
anecdotes: the minority grade schooler casually la- 
beled as having mental retardation on the basis of a 
single IQ score; the college student implausibly diag- 
nosed as schizophrenic from a projective test; the job 
applicant wrongfully screened from employment 
based upon an irrelevant measure; the aspiring teacher 
given unfair advantage when a competency test is 
mysteriously leaked beforehand; or the minority child 
penalized in testing because English is not her first 
language. Exceptions such as these illustrate the need 
for ethical and professional standards in testing. 

A major purpose of this topic is to introduce 
the reader to the ethical and professional standards 
that inform the practice of psychological testing. 


578 


We also pursue the related theme of special con- 
siderations in the testing of cultural and linguistic 
minorities. The two topics share substantial over- 
lap: When an examinee is not from the majority 
Anglo-American culture (predominantly caucasian, 
English-speaking, individualistic, future-oriented), 
ethical and professional concerns in testing rise to 


the forefront. 
THE RATIONALE FOR PROFESSIONAL 
TESTING STANDARDS 


Testing is generally applied in a responsible man- 
ner, but as previously noted, there are exceptions. 
On rare occasion, testing is irresponsible by design 
rather than by accident. Consider, with shuddering 
amazement, the advertisement for Mind Prober fea- 
tured in a pop psychology magazine: 


Read Any Good Minds Lately? With the Mind 
Prober you can. In just minutes you can have a sci- 
entifically accurate personality profile of anyone. 
This new expert systems software lets you discover 
the things most people are afraid to tell you. The 
strengths, weaknesses, sexual interests and more. 
(Eyde & Primhoff, 1992) 


In this case the irresponsibility is so blatant that dis- 
cussion of ethical and professional guidelines is al- 
most superfluous. 
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However, testing practices do not always pre- 
sent in sharply contrasting shades, responsible or 
irresponsible. The real challenge of competent as- 
sessment is to determine the boundaries of ethical 
and professional practice. As usual, it is the border- 
line cases that provide pause for thought. The reader 
is encouraged to read the quandaries of testing de- 
scribed in Case Exhibit 15.2 and form an opinion 
about each. These examples are based upon first- 





hand reports to the author. Atthe close of this chap- 
ter, we will return to these problematic vignettes. 
The dilemmas of psychological testing do 
not always have simple, obvious answers. Even 
thoughtful and experienced psychologists may 
disagree as to what is ethical or professional in a 
given instance. Nonetheless, the scope of ethical 
and professional practice is not a matter of individ- 
ual taste or personal judgment. Responsible test use 
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is defined by written guidelines published by pro- 
fessional associations such as the American Psy- 
chological Association, the American Counseling 
Association, the National Association of School 
Psychologists, and other groups. Whether they 
know it or not, all practitioners owe allegiance to 
these guidelines, which we review in the following 
sections. f 

In general, the evolution of professional and 
ethical standards has been almost uniformly re- 
strictive, providing an ever-narrowing demarcation 
of where, when, and how psychological tests 
may be used. Writing from a legal background, 
Bersoff (1984) summarizes the historical trend as 
follows: 


At one time, the work of academic and applied 
psychometricians went virtually unexamined by 
the law, but as the use of tests increased in the 
United States, so did their potential for causing 
legally cognizable injury to test takers. As a result, 
there is probably no current activity performed by 
psychologists so closely scrutinized and regulated 
by the legal system as testing. 


Partly in response to the modern climate of litiga- 
tion, organizations concerned with psychological 
testing have published guidelines that collectively 
define the ethical and professional standards rele- 
vant to the practice of assessment. 

These standards also pertain to corporations 
and individuals who publish tests. We begin with a 
survey of guidelines for test publishers before ex- 
amining the responsibilities of test users. The chap- 
ter closes with a review of special concerns in the 
testing of cultural and linguistic minorities. 


RESPONSIBILITIES OF 
TEST PUBLISHERS 


The responsibilities of publishers pertain to the 
publication, marketing, and distribution of their 
tests. In particular, it is expected that publishers will 
release tests of high quality, market their product in 
a responsible manner, and restrict distribution of 
tests only to persons with proper qualifications. We 
consider each of these points in turn. 


Publication and Marketing Issues 


Regarding the publication of new or revised instru- 
ments, the most important guideline is to guard 
against premature release of a test. Testing is a 
noble enterprise but it is also big business driven 
by the profit motive, which provides an inherent 
pressure toward early release of new or revised ma- 
terials. Perhaps this is why the American Psycho- 
logical Association and other organizations have 
published standards that relate to test publication 
(AERA/APA/NCME, 1985, 1999). These stan- 
dards pertain especially to the technical manuals 
and user guides that typically accompany a test. 
These sources must be sufficiently complete so that 
a qualified user or reviewer can evaluate the appro- 
priateness and technical adequacy of the test. This 
means that manuals and guides will report detailed 
statistics on reliability analyses, validity studies, 
normative samples, and other technical aspects. 

Marketing tests in a responsible manner refers 
not only to advertising (which should be accurate 
and dignified) but also to the way in which infor- 
mation is portrayed in manuals and guides. In par- 
ticular, test authors should strive for a balanced 
presentation of their instruments and refrain from a 
one-sided presentation of information. For exam- 
ple, if some preliminary studies reflect poorly on a 
test, these should be given fair weight in the man- 
ual alongside positive findings. Likewise, if a po- 
tential misuse or inappropriate use of a test can be 
anticipated, the test author needs to discuss this 
matter as well. 


Competence of Test Purchasers 


Test publishers recognize the broad responsibility 
that only qualified users should be able to purchase 
their products. By way of brief review (see Topic 
2A, The Nature and Uses of Psychological Tests) 
the reasons for restricted access include the poten- 
tial for harm if tests fall into the wrong hands (e.g., 
an undergraduate psychology major administers the 
MMPI-2 to his friends and then makes frightful pro- 
nouncements about the results) and the obvious fact 
that many tests are no longer valid if potential ex- 
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aminees have previewed them (e.g., a teacher mem- 
orizes the correct answers to a certification exam). 

These examples illustrate that access to psy- 
chological tests needs to be limited. But limited to 
whom? The answer, it turns out, depends upon the 
complexity of the specific test under consideration. 
Guidelines proposed many years ago by the Amer- 
ican Psychological Association (APA) are still rel- 
evant today, even though they are not enforced by 
all publishers. The APA proposed that tests fall into 
three levels of complexity (Levels A, B, and C) 
that require different degrees of expertise from the 
examiner. 


LevelA: These instruments are straightforward 
paper-and-pencil measures that can be ad- 
ministered, scored, and interpreted with min- 
imal training. With the aid of a manual, these 
tests can be used by responsible nonpsy- 
chologists such as business executives or ed- 
ucational administrators. This category 
includes vocational proficiency and group 
educational achievement tests. 

Level B: These tests require knowledge of test 
construction and training in statistics and 
psychology. These products are available to 
persons who have completed an advanced- 
level course in testing from an accredited 
college or university, or equivalent training 
under the supervision of a qualified psychol- 
ogist. This category includes aptitude tests 
and personality inventories applicable to nor- 
mal populations. 

Level C: These tests require substantial under- 
standing of testing and supporting topics. Su- 
pervised experience is essential for the proper 
administration, scoring, and interpretation of 
these instruments. Typically, Level C tests are 
available only to persons with a minimum of 
a master’s degree in psychology or an allied 
field. These instruments include individual 
tests of intelligence, projective personality 
tests, and neuropsychological test batteries 
(American Psychological Association, 1953). 


In general, test publishers try to screen out in- 
appropriate requests by requiring that purchasers 


have the necessary credentials. For example, the 
Psychological Corporation, one of the major sup- 
pliers of test materials in the United States, requires 
prospective customers to fill out a registration form 
detailing their training and experience with tests. 
Buyers who do not hold an advanced degree in psy- 
chology must list details of courses in the adminis- 
tration and interpretation of tests and in statistics. 
References are required, too. 

Most test publishers also specify that individuals 
or groups who provide testing and counseling by 
mail are not allowed to purchase materials. On a 
related note, ethical standards now discourage 
practitioners from giving “take-home” tests to 
clients. Until recent years, this has been an occa- 
sional practice with lengthy personality tests such as 
the MMPI. The ethics committee endorsed the 
following points: 


1. Nonmonitored administration of the MMPI gen- 
erally does not represent sound testing practice 
and may result in invalid assessment for a vari- 
ety of reasons (e.g., influence from other people 
or completion of the test while intoxicated). 

2. Test security cannot be guaranteed when the 
MMPI is allowed outside the clinical setting, 

3. There is debate as to whether there are ever any 
circumstances in which it might be reasonable 
and appropriate to allow an MMPI to be com- 
pleted away from the clinical setting. 

4. These issues are not unique to the MMPI, but 
must be considered in conducting any assess- 
ment. 

5. In judging the ethicality of at-home administra- 
tion of tests, it is important to consider such 
things as the nature and purpose of the test and 
available information regarding reliability, va- 
lidity, and standardization procedures (APA, 
1994b, pp. 665—666). 


In general, users are advised to refrain from giving 
take-home tests and publishers are counseled to 
deny access to practitioners or groups who promote 
this practice. 

Even though publishers attempt to filter out 
unqualified purchasers, there may still be instances 
in which sensitive tests are sold to unscrupulous 
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individuals. Oles and Davis (1977) discovered that 
graduate students in psychology could purchase the 
WISC-R, MMPI, TAT, Stanford-Binet, and 16PF if 
they typed their orders on college stationery, placed 
the letters Ph.D. after their names, enclosed pay- 
ment, and used a post office box return address. Al- 
though illicit test orders are few in number, they do 
occur. 





[|| RESPONSIBILITIES OF TEST USERS 


The psychological assessment of personality, inter- 
ests, brain functioning, aptitude, or intelligence is 
a sensitive professional action that should be com- 
pleted with utmost concern for the well-being of 
the examinee, his or her family, employers, and the 
wider network of social institutions that might be 
affected by the results of that particular clinical as- 
sessment (Matarazzo, 1986, 1990). Over the years, 
the profession of psychology has proposed, clari- 
fied, and sharpened a series of thorough and 
thoughtful standards to provide guidance for the 
individual practitioner. Professional organizations 
publish formal ethical principles that bear upon test 
use, including the American Psychological Associ- 
ation (APA, 1992a), the American Association for 
Counseling and Development (AACD, 1988), the 
American Speech-Language-Hearing Association 
(ASHA, 1991), and the National Association of 
School Psychologists (NASP, 1992). 

In addition to ethical principles, several testing 
organizations have published practice guidelines to 
help define the scope of responsible test use. 
Sources of test use guidelines include teaching 
groups (AFT, NCME, NEA, 1990), the American 
Psychological Association (APA, 1992b), the Edu- 
cational Testing Service (ETS, 1987, 1988, 1989), 
the Joint Committee on Testing Practices (JCTP, 
1988), the Society for Industrial and Organizational 
Psychology (SIOP, 1987), and professional al- 
liances (AERA, APA, NCME, 1985, 1999). Finally, 
we should mention that the principles of responsi- 
ble test use have been distilled in an illuminating 
casebook published jointly by several testing 
groups (Eyde, Robertson, Krug, and others, 1993). 


The dozens of guidelines relevant to testing are 
quite specific, for example: 


Standard 5.9: When test score information is re- 
leased to students, parents, legal representatives, 
teachers, clients, or the media, those responsible for 
testing programs should provide appropriate inter- 
pretations. The interpretations should describe in 
simple language what the test covers, what scores 
mean, the precision of the scores, common misinter- 
pretations of test scores, and how scores will be used. 


Because of their specificity, a detailed analysis of 
relevant ethical and professional standards is be- 
yond the scope of this text. What follows is a sum- 
mary of the general provisions that pertain to the 
responsible practice of psychological testing and 
clinical psychological assessment. 

These principles apply to psychologists, stu- 
dents of psychology, and others who work under 
the supervision of a psychologist. We restrict our 
discussion to those principles that are directly per- 
tinent to the practice of psychological testing. 
Proper adherence to these principles would elimi- 
nate most—but not all—legal challenges to testing. 


Best Interests of the Client 


Several ethical principles recognize that all psycho- 
logical services, including assessment, are provided 
within the context of a professional relationship. 
Psychologists are therefore enjoined to accept the 
responsibility implicit in this relationship. In gen- 
eral, the practitioner is guided by one overriding 
question: What is in the best interests of the client? 
The functional implication of this guideline is that 
assessment should serve a constructive purpose for 
the individual examinee. If it does not, the practi- 
tioner is probably violating one or more specific eth- 
ical principles. For example, Standard 11.15 in the 
Standards manual (AERA, APA, NCME, 1999) 
warns testers to avoid actions that have unintended 
negative consequences. Allowing a client to attach 
unsupported surplus meanings to test results would 
not be in the best interests of the client and would 
therefore constitute an unethical testing practice. In 
fact, with certain worry-prone and self-doubting 
clients, a psychologist may choose not to use an ap- 
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propriate test, since these clients are almost certain 
to engage in self-destructive misinterpretation of 
virtually any test findings. 


Confidentiality and the Duty to Warn 


Practitioners have a primary obligation to safe- 
guard the confidentiality of information, including 
test results, that they obtain from clients in the 
course of consultations (Principle 5; APA, 1992a). 
Such information can be ethically released to oth- 
ers only after the client or a legal representative 
gives unambiguous consent, usually in written 
form. The only exceptions to confidentiality in- 
volve those unusual circumstances in which the 
withholding of information would present a clear 
danger to the client or other persons. For example, 
most states have passed laws that mandate that 
health care practitioners must report all cases of 
suspected abuse in children and vulnerable elderly 
persons. In most states, a psychologist who learns 
in the course of testing that the client has physically 
or sexually abused a child is obligated to report that 
information to law enforcement. 

Psychologists also have a duty to warn that 
stems from the 1976 decision in the Tarasoff case 
(Wrightsman, Nietzel, Fortune, & Greene, 2002). 
Tanya Tarasoff was a young college student in Cal- 
ifornia who was murdered by Prosenjit Poddar, a 
student from India. What makes the case relevant 
to the practice of psychology is that Poddar had 
made death threats regarding Tarasoff to his cam- 
pus-based therapist. Although the therapist warned 
the police that Poddar had made death threats, he 
did not warn Tarasoff. Two months later, Poddar 
stabbed Tarasoff to death at her home. The parents 
of Tanya Tarasoff sued, and the California Supreme 
Court later agreed that therapists have a duty to use 
“reasonable care” to protect potential victims from 
their clients. Although the Tarasoff ruling has been 
modified by legislation in many states, the thrust of 
the case still stands: Clinicians must communicate 
any serious threat to the potential victim, law en- 
forcement agencies, or both. 

Finally, the clinician should consider the client’s 
welfare in deciding whether to release information, 


especially when the client is a minor who is unable 
to give voluntary, informed consent. When appro- 
priate, practitioners are advised to inform their 
clients of the legal limits of confidentiality. 


Expertise of the Test User 


A number of principles acknowledge that the test 
user must accept ultimate responsibility for the 
proper application of tests. From a practical stand- 
point, this means that the test user must be well 
trained in assessment and measurement theory. The 
user must possess the expertise needed to evaluate 
psychological tests for proper standardization, reli- 
ability, validity, interpretive accuracy, and other psy- 
chometric characteristics. This guideline has special 
significance in areas such as job screening, special 
education, testing of persons with disabilities, or 
other situations in which potential impact is strong. 

Psychologists who are poorly trained in their 
chosen instruments can make serious errors of test 
interpretation that harm examinees. Furthermore, 
inept test usage may expose the examiner to pro- 
fessional sanctions and civil lawsuits. A common 
error observed among inexperienced test users is 
the overzealous, pathologized interpretation of per- 
sonality test results (Case Exhibit 15.3). 

The expertise of the psychologist is particularly 
relevant when test scoring and interpretation ser- 
vices are used. The Ethical Principles of the Amer- 
ican Psychological Association leave no room for 
doubt: 


Psychologists retain appropriate responsibility for 

the appropriate application, interpretation, and use 
of assessment instruments, whether they score and 
interpret such tests themselves or use automated or 
other services. (APA, 1992a) 


The reader is referred to Topic 15A, Computerized 
Assessment and the Future of Testing, for further 
discussion of this point. 


Informed Consent 


Before testing commences, the test user needs to ob- 
tain informed consent from test takers or their legal 
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representatives. Exceptions to informed consent can 
be made in certain instances, for example, legally 
mandated statewide testing programs, school-based 
group testing, and when consent is clearly implied 
(e.g., college admissions testing). The principle of 
informed consent is so important that the Stan- 
dards manual devotes a separate standard to it: 


Informed consent implies that the test takers or 
representatives are made aware, in language that 
they can understand, of the reasons for testing, the 
type of tests to be used, the intended use and the 
range of material consequences of the intended use. 
If written, video, or audio records are made of the 
testing session, or other records are kept, test takers 
are entitled to know what testing information will 
be released and to whom. (AERA et al., 1999) 


Even young children or test takers with limited in- 
telligence deserve an explanation of the reasons for 


assessment. For example, the examiner might ex- 
plain, “I’m going to ask you some questions and 
have you work on some puzzles so I can see what 
you can do and find out what things you need more 
help with.” 

From a legal standpoint, the three elements of 
informed consent include disclosure, competency, 
and voluntariness (Melton, Petrila, Poythress, & 
Slobogin, 1998). The heart of disclosure is that the 
client receive sufficient information (e.g., about 
risks, benefits, release of reports) to make a 
thoughtful decision about continued participation 
in the testing. Competency refers to the mental ca- 
pacity of the examinee to provide consent. In gen- 
eral, there is a presumption of competency unless 
the examinee is a child, very elderly, or has mental 
disabilities (e.g., has mental retardation). In these 
cases, a guardian will need to provide legal consent. 
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Finally, the standard of voluntariness implies that 
the choice to undergo an assessment battery is 
given freely and not based on subtle coercion (e.g., 
inmates are promised release time if they partici- 
pate in research testing). In most cases, the exam- 
iner uses a written informed consent form such as 
that found in Figure 15.3. 


Obsolete Tests and the Standard of Care 


Standard of care is a loose concept that often arises 
in the professional or legal review of specific health 
practices, including psychological testing. The pre- 
vailing standard of care is one that is “usual, cus- 
tomary or reasonable” (Rinas & Clyne-Jackson, 
1988). To cite an extreme example, in medicine the 


standard of care for a fever might include the ad- 
ministration of aspirin—but would not include the 
antiquated practice of bleeding the patient. 
Practitioners of psychological testing must be 
wary of obsolete tests, because their use might vi- 
olate the prevailing standard of care. A case in point 
is the MMPI versus the MMPI-2. Even though the 
MMPI-2 is a relatively conservative revision of 
the highly esteemed MMPI, the improvements in 
norming and scale construction are substantial. The 
MMPI-2 is now the standard of care in MMPI- 
based assessment of psychopathology. Practition- 
ers who continue to rely upon the original MMPI 
could be liable for malpractice suits, especially if 
the test interpretation resulted in misleading inter- 
pretive statements or an incorrect diagnosis. 





This is an agreement between [Client’s name] and [Practitioner’s Name], Ph.D., a li- 
censed psychologist in the state of Illinois. You are encouraged to ask questions about 


experience or professional credentials at any time. 


1. General Information: The purpose of this assessment is to provide your [physician, 
counselor, therapist] with information about your psychological functioning that may 
prove helpful in his/her work with you. The assessment will involve a brief interview 
and psychological testing. This will take 3—4 hours of your time. 


2. Test Report: The relevant information from the interview and the test results will be 
summarized in a written report which will be sent to your [physician, counselor, ther- 
apist]. The test results and the report will be reviewed with you in approximately one 


week. 


3. Confidentiality: The report will not be released to any other source unless you request 
this formally in writing. Exceptions to this rule include these situations: your life or 
another person’s life is in danger, child or elder abuse is reported, or a court orders 


the disclosure of the report. 


4. Cost: An hourly rate of $___is used in arriving at the total fee. Some or all of this 
cost may be covered by your health insurance policy. The estimated total cost for 


your assessment is $__. 


5. Side Effects: Although most individuals enjoy the process of psychological consulta- 
tion, some persons find it uncomfortable, especially if the test results indicate psy- 


chological problems. It is appropriate for you to discuss your feelings with the 
examiner. You are free to withdraw your consent for ongoing testing at any time. 


6. Refusal of Assessment: You have the right to refuse this assessment. You are not re- 
quired to complete this evaluation in order to continue working with your [physician, 
counselor, therapist]. However, your treatment is more likely to be effective if you 
participate in this assessment. Upon request, I will discuss referral options with you. 


Client’s Name 





Date 


FIGURE 15.3 
Abbreviated Example of 
Informed Consent for 
Psychological Assessment 
Source: Adapted and abbrevi- 
ated with permission from 
Gregory, R. J. (1999). Foun- 
dations of intellecti'al assess- 
ment: The WAIS-III and other 
tests in clinical practice. 
Boston: Allyn and Bacon. 
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Another concern relevant to the standard of 
care is reliance upon test results that are outdated 
for the current purpose. After all, individual char- 
acteristics and traits show valid change over time. 
A student who meets the criteria for learning dis- 
ability in the fourth grade might show large gains 
in academic achievement, such that the LD diag- 
nosis is no longer accurate in the fifth grade. Per- 
sonality test results are especially prone to quixotic 
change. A short-term personal crisis might cause an 
MMPI-2 profile to look like a range of mountains. 
A week later, the test profile could be completely 
normal. It is difficult to provide comprehensive 
guidelines as to the “shelf life” of psychological 
test results. For example, GRE test scores that are 
years old still might be validly predictive of per- 
formance in graduate school, whereas Beck De- 
pression Inventory test results from yesterday could 
mislead a therapist as to the current level of de- 
pression. Practitioners must evaluate the need for 
retesting on an individual basis. 


Responsible Report Writing 


Except for group testing, the practice of psycho- 
logical testing invariably culminates in a written re- 
port that constitutes a semipermanent record of test 
findings and examiner recommendations. Effective 
report writing is an important skill because of the 
potential lasting impact of the written document. It 
is beyond the scope of this text to illuminate the 
qualities of effective report writing, although we 
can refer the reader to a few sources (Gregory, 
1999; Tallent, 1993). 

Responsible reports typically use simple and 
direct writing that steers clear of jargon and tech- 
nical terms. The proper goal of a report is to pro- 
vide helpful perspectives on the client, not to 
impress the referral source that the examiner is a 
learned person! When Tallent (1993) surveyed 
more than one thousand health practitioners who 
made referrals for testing, one respondent declared 
his disdain toward psychologists who “reflect their 
needs to shine as a psychoanalytic beacon in re- 
vealing the dark, deep secrets they have observed.” 
On a related note, effective reports stay within the 
bounds of expertise of the examiner. For example: 


It is never appropriate for a psychologist to recom- 
mend that a client undergo a specific medical pro- 
cedure (such as a CT scan for an apparent brain 
tumor) or receive a particular drug (such as Prozac 
for depression). Even when the need for a special 
procedure seems obvious (e.g., the symptoms 
strongly attest to the rapid onset of a brain disease), 
the best way to meet the needs of the client is to 
recommend immediate consultation with the ap- 
propriate medical profession (e.g., neurology or 
psychiatry). (Gregory, 1999) 


Additional advice on effective report writing can be 
found in Ownby (1991) and Sattler (2001). 


Communication of Test Results 


Individuals who take psychological tests anticipate 
that the results will be shared with them. Yet prac- 
titioners often do not include one-to-one feedback 
as part of the assessment. A major reason for re- 
luctance is a lack of training in how to provide feed- 
back, especially when the test results appear to be 
negative. For example, how does a clinician tell a 
college student that her IQ is 93 when most stu- 
dents in that milieu score 115 or higher? 

Providing effective and constructive feedback 
to clients about their test results is a challenging 
skill to learn. Pope (1992) emphasizes the respon- 
sibility of the clinician to determine that the client 
has understood adequately and accurately the in- 
formation that the clinician was attempting to con- 
vey. Furthermore, it is the responsibility of the 
clinician to check for adverse reactions: 


Is the client exceptionally depressed by the find- 
ings? Is the client inferring from findings suggest- 
ing a learning disorder that the client—as the client 
has always suspected—is “stupid”? Using scrupu- 
lous care to conduct this assessment of the client’s 
understanding of and reactions to the feedback is 
no less important than using adequate care in ad- 
ministering standardized psychological tests; test 
administration and feedback are equally important, 
fundamental aspects of the assessment process. 

(p. 271) 


Proper and effective feedback involves give-and- 
take dialogue in which the clinician ascertains how 
the client has perceived the information and seeks 
to correct potentially harmful interpretations. 
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Destructive feedback often arises when the 
clinician fails to challenge a client’s incorrect per- 
ceptions about the meaning of test results. Consider 
IQ tests in particular—a case in which many: per- 
sons deify test scores and consider them an index 
of personal worth. Prior to providing test results, a 
clinician is advised to investigate the client’s un- 
derstanding of what IQ scores mean. After all, IQ 
is a limited slice of intellectual functioning: It does 
not evaluate drive or character of any kind, it is ac- 
curate only to about +5 points, it may change over 
time, and it does not assess many important attrib- 
utes such as creativity, social intelligence, musical 
ability, or athletic skill. But a client may have an 
unrealistic perspective about IQ and hence might 
jump to erroneous conclusions when hearing that 
her score is “only” 93. The careful practitioner will 
elicit the client’s views and challenge them when 
needed before proceeding. Further thoughts on 
feedback can be found in Gass and Brown (1992) 
and Pope (1992). 

Going beyond the general pronouncement to 
avoid harm when providing test feedback, Finn and 
Tonsager (1997) present the intriguing view that in- 
formation about test results should be directly and 
immediately therapeutic to individuals experienc- 
ing psychological problems. In other words, they 
propose that psychological assessment is a form of 
short-term intervention, not just a basis for gather- 
ing information that is later used for therapeutic 
purposes. In one study (Finn & Tonsager, 1992), 
they examined the effects of a brief psychological 
assessment on clients at a university counseling 
center. Thirty-two students took part in an initial 
interview, completed the MMPI-2, and then re- 
ceived a one-hour feedback session conducted ac- 
cording to a method developed by Finn (1996). A 
comparison group of 29 students was interviewed 
and received an equal amount of supportive, nondi- 
rective psychotherapy instead of the test feedback. 
The clients in the MMPI-2 assessment group 
showed a greater decline in symptomatic distress 
and a greater increase in self-esteem, immediately 
following their feedback session and also two 
weeks later, than the clients in the comparison 
group. The feedback group also felt more hopeful 
about their problems after the brief assessment. 


These findings illustrate the importance of provid- 
ing thoughtful and constructive test feedback in- 
stead of rushing through a perfunctory review of 
the results. 


Consideration of Individual Differences 


Knowledge of and respect for individual differ- 
ences is highlighted by all professional organiza- 
tions that deal with psychological testing. The 
American Psychological Association lists this as 
one of six guiding principles: 
Principle D: Respect for People’s Rights and Dignity 
... Psychologists are aware of cultural, individual, 
and role differences, including those due to age, 
gender, race, ethnicity, national origin, religion, 
sexual orientation, disability, language, and socio- 
economic status. Psychologists try to eliminate the 
effect on their work of biases based on those fac- 
tors, and they do not knowingly participate in or 
condone unfair discriminatory practices. (APA, 
1992a) 


The relevance of this principle to psychological 
testing is that practitioners are expected to know 
when a test or interpretation may not be applicable 
because of factors such as age, gender, race, eth- 
nicity, national origin, religion, sexual orientation, 
disability, language, and socioeconomic status. We 
can illustrate this point with a case study reported 
in Eyde et al. (1993). A psychologist evaluated a 
75-year-old man at the request of his wife, who had 
noticed memory problems. The psychologist ad- 
ministered a mental status examination and a 
prominent intelligence test. Performance on the 
mental status examination was normal, but stan- 
dard scores on the intelligence test revealed a large 
discrepancy between verbal subtests and subtests 
measuring spatial ability and processing speed. The 
psychologist interpreted this pattern as indicating a 
deterioration of intellectual functioning in the hus- 
band. Unfortunately, this interpretation was based 
upon faulty use of non-age-corrected standard 
scores. Also, the psychologist did not assess for de- 
pression, which is known to cause visuospatial per- 
formance to drop sharply (Wolff & Gregory, 1992). 
In fact, a series of further evaluations revealed that 
the husband was a perfectly healthy 75-year-old 
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man. The psychologist failed to consider the rele- 
vance of the gentleman’s age and emotional status 
when interpreting the intelligence test. This was a 
costly oversight that caused the client and his wife 
substantial unnecessary worry. 


TESTING OF CULTURAL AND 
LINGUISTIC MINORITIES 


Background and Historical Notes 


Persons of ethnic minority descent (non-European 
origin) currently constitute about 25 percent of the 
U.S. population, and it is estimated that they will 
comprise more than 50 percent within several 
decades (Dana, 1993). Yet the enterprise of testing 
is based almost entirely upon the efforts of white 
psychologists who bring an Anglo-American view- 
point to their work. The suitability of existing tests 
for the evaluation of diverse populations cannot be 
taken for granted. The assessment of ethnic minor- 
ity individuals raises important questions, espe- 
cially when test results translate to placement 
decisions or other sensitive outcomes, as is com- 
monly the case within educational institutions. 

As noted in Chapter 1 (The History of Psy- 
chological Testing), early pioneers in the testing 
movement largely ignored the impact of cul- 
tural background on test results. For example, in 
the 1920s Henry Goddard concluded that the in- 
telligence of the average immigrant was alarm- 
ingly low, “perhaps of moron grade.” Yet he 
downplayed the likelihood that language and cul- 
tural differences could explain the low test scores 
of immigrants. 

Perhaps as a rebound against these early meth- 
ods, beginning in the 1930s psychologists dis- 
played an increased sensitivity to cultural variables 
in the practice of testing. A shining example in this 
regard was Stanley Porteus, who undertook a wide- 
ranging investigation of the temperament and in- 
telligence of Australian aboriginal peoples. Porteus 
(1931) used many traditional instruments (block 
designs, mazes, digit span), but to his credit he also 
devised an ecologically valid measure of intelli- 


gence for this group, namely, footprint recognition. 
Whereas the aboriginal examinees performed 
poorly on the Eurocentric tests, their ability to rec- 
ognize photographed footprints was on a par with 
other racial groups studied. Even so, Porteus dis- 
played an acute awareness that his procedures still 
might have handicapped the aboriginals: 


< The photograph of a footprint is not the same as the 
footprint itself, and quite probably a number of 
cues that are made use of by the aboriginal tracker 
are absent from a photograph. The varying depths 
of parts of the foot impression are not visible in the 
photograph, and the individual peculiarities other 
than general shape and size of the footprint may 
not be brought out clearly. Hence we must expect 
that the aboriginal subjects would be under some 
disadvantage in matching these photographs of 
footprints, as against recognition of the footprints 
themselves. (p. 399-400) 


In a similar vein, DuBois (1939) found that Pueblo 
Indian children displayed superior ability on his 
specially devised horse drawing test of mental abil- 
ity, whereas they performed less well on the main- 
stream Goodenough (1926) Draw-A-Man test. 
From these early studies onward, psychologists 
have maintained a keen interest in the impact of lan- 
guage and culture on the meaning of test results. 


The Impact of Cultural 
Background on Test Results 


Practitioners need to appreciate that the cultural 
background of examinees will impact the en- 
tire process of assessment. For this reason, Sattler 
(1988) advises assessment psychologists to ap- 
proach their task from a pluralistic standpoint: 


Cultural groups may vary with respect to cultural 
values (stemming in part from cultural shock, dis- 
continuity, or conflict); language and nuances in 
language style; views of life and death; roles of 
family members; problem-solving strategies; atti- 
tudes toward education, mental health, and mental 
illness; and stage of acculturation (the group may 
follow traditional values, accept the dominant 
group’s values, or be at some point between the 
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two). You should adopt a frame of reference that 
will enable you to understand how particular be- 
haviors make sense within each culture. (p. 565) 


For example, it is often noted that Native Americans 
display a distinctive conception of time, emphasiz- 
ing present-time as opposed to the future-time ori- 
entation that is so powerfully formative in white, 
middle-class America (Panigua, 1994). A possible 
implication of this cultural difference is that time 
limits might not mean the same thing for a Native 
American child as for a child from the mainstream 
culture. Perhaps the minority child will disregard 
the subtest instructions and work at a careful, mea- 
sured pace rather than seeking quick solutions. Of 
course, this child would then obtain a misleadingly 
low score on that measure. 

While acknowledging the impact of cultural dif- 
ferences on testing, it is also important to avoid 
stereotypical overgeneralization. Culture is not 
monolithic. Every person is unique. Some Native 
Americans will exhibit a distinctive orientation to 
time but perhaps most will not. The challenge for the 
practitioner is to observe the clinical details of per- 
formance and to identify the culture-based nuances 
of behavior that help determine the test results. 

An ingenious study by Moore (1986) power- 
fully illustrates the relevance of cultural back- 
ground for understanding the test performance of 
ethnic minority examinees. She compared not only 
the intelligence test scores but also the qualitative 
manner of responding to test demands in two groups 
of adopted African American children. One group 
of 23 children had been transracially adopted into 
middle-class white families. The other group of 23 
children had been intraracially adopted into mid- 
dle-class African American families. All children 
were adopted prior to age 2 and the backgrounds of 
the adoptive families were similar in terms of edu- 
cation and social class. Thus, group difference in 
test scores and test behaviors could be attributed 
mainly to differences in cultural background aris- 
ing from the fact that one group was adopted into 
African American families, the other adopted into 
white families. Testing and observations were com- 
pleted by two female African American examiners 


who were “blind” to the purposes of the study. Tested 
at 7 to 10 years of age, the transracially adopted chil- 
dren scored an average IQ of 117 on the WISC 
compared to an average IQ of 104 for the tradi- 
tionally adopted children. These IQ results were not 
remarkable, insofar as Scarr and Weinberg reported 
similar findings years before. 

The surprising and informative outcome of the 
study was that the two groups of children showed 
very different qualitative behaviors during testing. 
As a group, the children with lower IQ scores 
(those adopted by African American families) were 
less likely to spontaneously elaborate on their work 
responses and more likely simply to refuse to re- 
spond when presented with a test demand. Moore 
(1986) offers the following interpretations: 


Children’s tendency to spontaneously elaborate 

on their work responses may be a very important 
index of their level of involvement in task perfor- 
mance, strategies for problem solving, level of 
motivation to generate a correct response, and level 
of adjustment to the standardized test situation. . . . 
Although the terminal not-work response is treated 
as an incorrect response, it does not actually pro- 
vide any empirical documentation of what the 
child does or does not know or of what the child 
can and cannot do. The only information available 
is that the child did not respond to the demand. 

(p. 322) 


The essential lesson of this study is that culturally 
based differences in response style may function to 
conceal the underlying competence of some exam- 
inees. Cautious interpretation of test results is al- 
ways advisable, but this is especially important for 
examinees from culturally or linguistically diverse 
backgrounds. 

The influence of cultural factors is not limited 
to the test performance of children, but extends to 
adults as well. Terrell, Terrell, and Taylor (1981) 
investigated the effects of racial trust/mistrust on 
the intelligence test scores of African American 
college students. They identified African American 
students with high and low levels of mistrust of 
whites. Using a 2 x 2 design, half of each group was 
then administered an individual intelligence test 
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by a white examiner, the other half by an African 
American examiner. As predicted, the analysis of 
variance revealed no differences for the main ef- 
fects of race of examiner (white versus African 
American) or level of mistrust (high versus low) 
(Figure 15.4). But a substantial interaction was re- 
vealed, namely, the high-mistrust group with an 
African American examiner scored much better 
than the high-mistrust group with a white examiner 
(average IQs of 96 versus 86, respectively). Put 
simply, cultural mistrust among African Americans 
was associated with significantly lower IQ scores, 
but only when the examiner was white. 

Further illustrating cultural influences, Steele 
(1997) has proposed a theory that societal stereotypes 
about groups influence the immediate intellectual 
performance and also the long-term identity devel- 
opment of individual group members. He has applied 
this theory both to women—when stereotypes affect 
their achievement in math and sciences—and to 
African Americans—when stereotypes apparently 
depress their performance on standardized tests. 
Here we discuss his research on stereotype threat 
with African American college students (Steele & 
Aronson, 1995). 
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FIGURE 15.4 Mean IQ Scores of African American 
Students as a Function of Race of Examiner and Cultural 
Mistrust 

Source: Based on data in Terrell, F., Terrell, S., & Taylor, J. 
(1981). Effects of race of examiner and cultural mistrust on the 
WAIS performance of Black students. Journal of Consulting 
and Clinical Psychology, 49, 750-751. 


The idea of stereotype threat is essentially a 
sophisticated version of a self-fulfilling prophecy. 
The researchers define stereotype threat as the 
threat of confirming, as self-characteristic, a neg- 
ative stereotype about one’s group. For example, 
based upon published data and media coverage 
about race and IQ scores, African Americans are 
stereotyped as possessing less intellectual ability 
than others. As a consequence, whenever they en- 
counter tests of intelligence or academic achieve- 
ment, individuals from this group may perceive a 
risk that they will confirm the stereotype. In the 
short run, stereotype threat is hypothesized to 
depress test performance through heightened anx- 
iety and other mechanisms. In the long run, it may 
have the further impact of pressuring African 
American students to “protectively disidentify” 
with achievement in school and related intellec- 
tual domains. 

Steele and Aronson (1995) conducted a series 
of four studies to evaluate the hypothesis of stereo- 
type threat. All the investigations supported the 
hypothesis. We focus here upon the first study, 
in which African American and white college stu- 
dents were given a 30-minute test composed of 
challenging items from the verbal Graduate Record 
Examination. Students from both racial groups 
were randomly assigned to one of three test con- 
ditions: stereotype-threat, in which the test was 
described as diagnostic of individual verbal ability; 
control, in which the test was described as a re- 
search tool only; and control-challenge, in which 
the test was described as a research tool only but 
participants were exhorted to “take this challenge 
seriously.” Scores on the verbal test were adjusted 
(covariate analysis) on the basis of prior achieve- 
ment scores so as to eliminate the effects of preex- 
isting differences between groups. 

Race differences were small and nonsignificant 
in the control and control-challenge conditions, 
whereas African Americans scored much lower 
than whites in the stereotype-threat condition (Fig- 
ure 15.5). In other studies, Steele and Aronson 
(1995) investigated the mechanism of mediation by 
which stereotype threat caused African Americans 
to score lower on standardized tests. The details are 
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FIGURE 15.5 Average Verbal Items Correct for Whites 
and African Americans under Three Conditions 

Source: Based on data in Steele, C. M., & Aronson, J. (1995). 
Stereotype threat and the intellectual test performance of African 
Americans. Journal of Personality and Social Psychology, 69, 
797-811. 


beyond the scope of this text, but the overall con- 

clusion is not: 
Our best assessment is that stereotype threat caused 
an inefficiency of processing much like that caused 
by other evaluative pressures. Stereotype-threat- 
ened participants spent more time doing fewer 
items more inaccurately—probably as a result of 
alternating their attention between trying to answer 
the items and trying to assess the self-significance 
of their frustration. (Steele & Aronson, 1995, 809) 


In sum, the authors propose a social-psychological 
perspective on the meaning of lower test scores in 
African Americans and perhaps other stereotype- 
threatened groups as well. Their viewpoint em- 
phasizes that test results do not reside within 
individuals. Test scores occur within a complex so- 
cial-psychological field that is potentially influ- 
enced by national history, predicaments of race, 
and many other subtle factors. 


Assessment of Cultural 
and Linguistic Minorities 


Increasingly in the last 50 years, the field of psy- 
chology has recognized that specialized practices 


may be needed to accomplish equitable testing with 
cultural and linguistic minorities. Sensitivity to the 
unique issues encountered in minority assessment 
first emerged in the 1960s when the Society for the 
Psychological Study of Social Issues published 
guidelines for testing minority children. More re- 
cently, the profession of psychology has reiterated 
its concern for validity in the assessment of cultural 
and linguistic minorities by publishing guidelines for 
providers of psychological services to minority pop- 
ulations (APA, 1993). These broad guidelines pro- 
vide aspirational goals (e.g., examiners are expected 
to recognize cultural diversity), but they furnish lit- 
tle in the way of specific advice for the practitioner 
of assessment. A flurry of influential books and ar- 
ticles has ensued, each offering thoughtful commen- 
tary on multicultural assessment (e.g., Lam, 1993; 
Suzuki, Meller, & Ponterotto, 1996; Rogers, 1998). 
Certain cautionary themes can be identified in this 
literature, as discussed in the following paragraphs. 

An overriding consideration is that linguistic 
barriers may inhibit test performance of minority 
individuals. Valdes and Figueroa (1994) express the 
problem as follows: 


When a bilingual individual confronts a monolingual 
test, developed by monolingual individuals, and 
standardized and normed on a monolingual popula- 
tion, both the test taker and the test are asked to do 
something that they cannot. The bilingual test taker 
cannot perform like a monolingual. The monolin- 
gual test can’t “measure” in the other language. 

(p. 172) 


What this means from a practical standpoint is that 
even when a bilingual minority child is fluent in 
English, an English-language test might still un- 
derestimate his or her true level of ability. The sen- 
sitive examiner will acknowledge this possibility 
when interpreting test results. 

It might appear that a native language inter- 
preter could be used to facilitate testing of exami- 
nees whose first language is not English. In general, 
testing specialists advise against this practice be- 
cause interpreters may substitute words, speak in a 
different dialect, or engage in subtle prompting 
that influences the examinees’ responses (Rogers, 
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1998). A well-trained bilingual psychologist would 
be preferable, but even this practice is considered 
problematic by some (Figueroa, 1990). 

The preferred option is to use tests translated 
into the examinee’s native language and normed on 
relevant subpopulations. Unfortunately, there are 
relatively few instruments available for this pur- 
pose. Spanish versions of prominent tests are the 
exceptions to this trend. Examples include the Es- 
cala de Inteligencia Wechsler Para Ninos-Revisada 
de Puerto Rico (EIWN-R PR), a Spanish version of 
the WISC-R, and TEMAS, a Spanish version of the 
Thematic Apperception Test composed of new and 
culturally relevant stimulus cards. The EIWN-R PR 
is a 1993 Spanish adaptation of the WISC-R. This 
test was normed on 2,200 Puerto Rican children be- 
tween the ages of 6 years and 16 years and 11 
months. The EIWN-R PR retains the original struc- 
ture and content of the WISC-R while providing di- 
rections in Spanish før administration and scoring 
of all 12 subtests. TEMAS was discussed briefly in 
Topic 13B, Projective Techniques. A recent exam- 
ple of a test developed exclusively for the Latino 
population is the Neuropsychological Screening 
Battery for Hispanics created by Ponton, Gonzalez, 
Hernandez, Herrera, and Higareda (2000). 

In addition to possible linguistic barriers, mi- 
nority examinees may exhibit a lack of sophistica- 
tion about test taking that further compounds their 
disadvantage. Writing from a Hispanic back- 
ground, Padilla and Medina (1996) make the fol- 
lowing observations: 


It is quite probable that minority students are less 
familiar with standardized achievement testing and 
thus less testwise than majority students, most of 
whom have been exposed to standardized testing 
over an extended time. Another consideration is 
that more educated parents who are more testwise 
themselves engage in more coaching with their 
children to instruct them on strategies known to be 
useful on multiple-choice tests and in the impor- 
tance of balancing speed and accuracy in objective 
type tests. This type of coaching and practice is 
generally not found in the homes of lower socio- 
economic status children because their parents 
may not be knowledgeable themselves of good 
test-taking practices. (p. 14) 


There is no immediate redress to this kind of prob- 
lem other than to acknowledge that minority test 
scores may reflect a lack of test sophistication. 

The likelihood that linguistic barriers and lack 
of test sophistication will influence test results of 
minorities is a strong argument in favor of using 
a careful multidisciplinary assessment approach. 
When assessment embodies the multiple perspec- 
tives of several disciplines (e.g., psychology, 
speech, reading specialists), it is less likely that 
erroneous assessment results from any single dis- 
cipline will prove damaging. This is especially true 
in school-based assessment of culturally and lin- 
guistically diverse children. Rogers (1998) recom- 
mends that assessment of minority children 
proceed along these lines: 


1. Multidisciplinary assessments involving infor- 
mation gathered from a variety of sources and 
methods 

2. Assessments conducted in the child’s native lan- 
guage as well as English 

3. Assessments that protect children from selection 
and administration practices that are racially and 
culturally discriminatory 

4. Clearly specified procedures for assessing lin- 
guistically diverse children 

5. Informed parental consent and notification of 
rights to due process (p. 357). 


These points remind us once again that the practice 
of assessment occurs within a value-laden social 
context. The competent examiner displays a sensi- 
tivity to cultural and linguistic diversity in the ap- 
plication and interpretation of psychological tests. 


ii REPRISE: RESPONSIBLE TEST USE 


We return now to the real-life quandaries of testing 
mentioned at the beginning of the topic. The reader 
will recall that the first quandary had to do with 
whether a consulting psychologist responsibly 
could refuse to provide feedback to police officer 
candidates referred for preemployment screening. 
Surprisingly, the answer to this query is “Yes.” 
Under normal circumstances, a practitioner must 
explain assessment results to the client. But there 


_ 
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are exceptions, as explained by Principle 9.10 of 
the APA Ethical Code: 


Psychologists take reasonable steps to ensure that 
explanations of results are given to the individual or 
designated representative unless the nature of the 
relationship precludes provision of an explanation 
of results (such as in some organizational consult- 
ing, preemployment or security screenings, and 
forensic evaluations), and this fact has been clearly 
explained to the person being assessed in advance. 


The second quandary concerned a counselor 
who continued to use the MMPI even though the 
MMPI-2 has been available for several years. Is the 
counselor’s refusal to use the MMPI-2 a breach of 
professional standards? The answer to this query is 
probably “Yes.” The MMPI-2 is well validated and 
constitutes a significant improvement upon the 
MMPI. As mentioned previously, the MMPI-2 is 
now the standard of care in MMPI-based assess- 
ment of psychopathology. The counselor who con- 
tinued to rely upon the original MMPI could be 
liable for malpractice suits, especially if his test in- 
terpretations resulted in misleading interpretive 
statements or a false diagnosis. 

The third predicament involved the use of a 
neighborhood friend as translator in the adminis- 
tration of the WISC-III to a nine-year-old boy 
whose first language was Spanish. This is usually 


a mistake as it sacrifices strict control of the testing 
material. The examiner was not bilingual and there- 
fore he would have no way of knowing whether the 
translator was remaining faithful to the original text 
or was possibly supplying additional cues. In an 
ideal world, the proper procedure would be to en- 
list a Spanish-speaking examiner who would use a 
test formally translated and also standardized with 
Hispanic examinees. For example, the Escala de In- 
teligencia Wechsler Para Ninos-Revisada de Puerto 
Rico (EIWN-R PR), discussed previously, would 
be a good choice. 

The final quandary concerned the client who in- 
formed a psychologist that her -ecently deceased 
brother was most likely a pedophile. Is the psy- 
chologist obligated to report this case to law en- 
forcement? The answer to this query is probably 
“Yes,” but it may depend upon the jurisdiction of 
the psychologist and the wording of the relevant 
statutes. In fact, the psychologist did report the case 
to authorities, with unexpected consequences. Po- 
lice obtained a search warrant, went to the home of 
the client’s mother (where the brother had lived), 
and ransacked the brother’s bedroom. The mother 
was traumatized by the unexpected visit from the 
police and blamed the fiasco on her daughter. A bit- 
ter estrangement followed, and the client then sued 
the psychologist for violation of confidentiality! 


SUMMARY 


1. As is true of all professional activities of 
psychologists, testing is guided by ethical and pro- 
fessional standards. Responsible test use is defined 
by written guidelines published by professional as- 
sociations such as the American Psychological As- 
sociation and other groups. 


2. Test publishers also follow professional 
guidelines, including the expectation that they will 
‘release tests of high quality, market their products 
in aresponsible manner, and restrict distribution of 
tests only to persons with proper qualifications. 


3. Although there are exceptions, testing is 


generally guided by one overriding question: What 
is in the best interests of the client? The functional 


implication of this guideline is that assessment 
should serve a constructive purpose for the indi- 
vidual examinee. 


4. Psychologists have a primary obligation to 
safeguard the confidentiality of information, in- 
cluding test results, that they obtain from clients in 
the course of consultations. Exceptions include 
those unusual circumstances in which the with- 
holding of information would present a clear dan- 
ger to the client or other persons. 


5. Psychologists have a duty to warn that stems 
from the 1976 decision in the Tarasoff case. Clini- 
cians must communicate any serious threat to a po- 
tential victim, law enforcement agencies, or both. 
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6. The ultimate responsibility for the proper 
application of tests always rests with the test user. 
From a practical standpoint, this means that the test 
user must be well trained in assessment and mea- 
surement theory. 

7. The professional standard on informed 
consent provides that test takers must be informed 
of the reasons for testing, the types of tests to be 
used, the possible consequences of the testing, and 
what testing information will be released and to 
whom. 


8. The prevailing standard of care is one that 
is usual, customary, or reasonable. Meeting the 
standard of care means that psychologists should 
refrain from using outdated tests, especially when 
a new edition is available. 


9.. Other guidelines for the responsible use of 
tests include thoughtful and effective report writ- 
ing, as well as a reflective and sensitive delivery of 
feedback to examinees in which their misconcep- 
tions are carefully dispelled. 


10. Another expectation is that assessment will 
be guided by a knowledge of, and respect for, indi- 


vidual differences. For example, practitioners 
should know about the effects of age, gender, race, 
ethnicity, and other background variables on test 
results. 


11. Cultural factors that may influence test re- 
sults include the qualitative manner of approach- 
ing a test, racial trust/mistrust, and stereotype 
threat, which is the threat of confirming, as self- 
characteristic, a negative stereotype about one’s 
group. 

12. Linguistic barriers also may inhibit test 
performance of minority individuals. Bilingual per- 
sons, and individuals whose first language is not 
English, may encounter subtle problems on tests 
developed for use in the dominant culture. 


13. A lack of sophistication about the nature of 
tests is another factor encountered by some minor- 
ity individuals. Linguistic barriers and a lack of 
sophistication about testing are strong arguments 
in favor of using a multidisciplinary approach to 
assessment (e.g., psychology, speech, reading 
specialists). 
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2200 B.c> Chinese begin civil service examinations. 


1838 


1862 


1866 


1869 


1884 


1890 


1896 


1901 


1904 


Jean Esquirol distinguishes between 
mental illness and mental retardation. 
Wilhelm Wundt uses a calibrated pendu- 
lum to measure the “speed of thought.” 
O. Edouard Seguin writes the first major 
textbook on the assessment and treat- 
ment of mental retardation. 

Wundt founds the first experimental lab- 
oratory in psychology in Leipzig, Ger- 
many. 

Francis Galton administers the first test 
battery to thousands of citizens at the In- 
ternational Health Exhibit. 

James McKeen Cattell uses the term 
mental test in announcing the agenda for 
his Galtonian test battery. 

Emil Kraepelin provides the first com- 
prehensive classification of mental dis- 
orders. 

Clark Wissler discovers that Cattellian 
“brass instruments” tests have no corre- 
lation with college grades. 

Charles Spearman proposes that intelli- 
gence consists of a single general factor 
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Major Landmarks in 
the History of 
Psychological Testing 


g. and numerous specific factors s,, s5, 
$3, and so forth. 

Karl Pearson formulates the theory of 
correlation. 

Alfred Binet and Theodore Simon invent 
the first modern intelligence test. 

Henry H. Goddard translates the Binet- 
Simon scales from French into English. 
Stern introduces the IQ, or intelligence 
quotient: the mental age divided by. 
chronological age. . 
Lewis Terman revises the Binet-Simon 
scales, publishes the Stanford-Binet; re- 
visions appear in 1937, 1960, and 1986. 
Robert Yerkes spearheads the develop- 
ment of the Army Alpha and Beta exam- 
inations used for testing WWI recruits. 
Robert Woodworth develops the Personal 
Data Sheet, the first personality test. 
The Rorschach inkblot test is published. 
Psychological Corporation—the first 
major test publisher—is founded by Cat- 
tell, Thorndike, and Woodworth. 
Florence Goodenough publishes the 
Draw-A-Man Test. 
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1926 


1927 


1935 
1936 
1936 


1938 


1938 
1938 
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1938 


1939 


1939 


1939 
1942 


1948 


1949 


APPENDIX A 


The first Scholastic Aptitude Test is pub- 
lished by the College Entrance Exami- 
nation Board. 

The first edition of the Strong Vocational 
Interest Blank is published. 

The Thematic Apperception Test is re- 
leased by Morgan and Murray at Har- 
vard University. 

Lindquist and others publish the precur- 
sor to the Iowa Tests of Basic Skills. 
Edgar Doll publishes the Vineland Social 
Maturity Scale for assessment of adap- 
tive behavior in those with mental retar- 
dation. 

L. L. Thurstone proposes that intelli- 
gence consists of about seven group 
factors known as primary mental 
abilities. 

Raven publishes the Raven’s Progressive 
Matrices, a nonverbal test of reasoning 
intended to measure Spearman’s g factor. 
Lauretta Bender publishes the Bender 
Visual Motor Gestalt Test, a design- 
copying test of visual-motor integration. 
Oscar Buros publishes the first Mental 
Measurements Yearbook. 

Arnold Gesell releases his scale of infant 
development. 

The Wechsler-Bellevue Intelligence 
Scale is published; revisions are pub- 
lished in 1955 (WAIS), 1981 (WAIS-R), 
and 1997 (WAIS-ID). 

Taylor-Russell tables published for de- 
termining the expected proportion of 
successful applicants with a test. 

The Kuder Preference Record, a forced- 
choice interest inventory, is published. 
The Minnesota Multiphasic Personality 
Inventory (MMPI) is published. 

Office of Strategic Services (OSS) uses 
situational techniques for selection of 
officers. 

The Wechsler Intelligence Scale for 
Children is published; revisions are 
published in 1974 (WISC-R) and 1991 
(WISC-IID. 


1950 


1951 


1968 


1969 


1969 


1971 


The Rotter Incomplete Sentences Blank 
is published. 

Lee Cronbach introduces coefficient 
alpha as an index of reliability (internal 
consistency) for tests and scales. 
American Psychiatric Association pub- 
lishes the Diagnostic and Statistical 
Manual (DSM-I). 

Stephenson develops the Q-technique 
for studying the self-concept and other 
variables. 

Paul Meehl publishes Clinical vs. Statis- 
tical Prediction. 

The Halstead-Reitan Test Battery begins 
to emerge as the premiere test battery in 
neuropsychology. 

C. E. Osgood describes the semantic 
differential. 

Lawrence Kohlberg publishes the first 
version of his Moral Judgment Scale; re- 
search with it expands until the mid- 
1980s. 

Campbell and Fiske publish a test vali- 
dation approach known as the multitrait- 
multimethod matrix. 

Raymond Cattell proposes the theory of 
fluid and crystallized intelligences. 

In Hobson v. Hansen the court rules 
against the use of group ability tests to 
“track” students on the grounds that 
such tests discriminate against minority 
children. 

American Psychiatric Association pub- 
lishes DSM-I. 

Nancy Bayley publishes the Bayley 
Scales of Infant Development (BSID). 
The revised version (BSID-2) is pub- 
lished in 1993. 

Arthur Jensen proposes the genetic 
hypothesis of African American versus 
white IQ differences in the Harvard Ed- 
ucational Review. 

In Griggs v. Duke Power the Supreme 
Court rules that employment test results 
must have a demonstrable link to job 
performance. 
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1980 
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George Vaillant popularizes a hierarchy 
of 18 ego adaptive mechanisms and de- 
scribes a methodology for their assess- 
ment. 

Court decision requires that tests used 
for personnel selection must be job rele- 
vant (Griggs v. Duke Power). 

The Model Penal Code rule for legal in- 
sanity is published and widely adopted 
in the United States. 

Rudolf Moos begins publication of the 
Social Climate Scales to assess different 
environments. 

Friedman and Rosenman popularize the 
Type A coronary-prone behavior pattern; 
their assessment is interview-based. 

The U.S. Congress passes Public Law 
94-142, the Education for All Handi- 
capped Children Act. 

Jane Mercer publishes SOMPA (System 
of Multicultural Pluralistic Assessment), 
a test battery designed to reduce cultural 
discrimination. 

In the Uniform Guidelines on Employee 
Selection adverse impact is defined by 
the four-fifths rule; also guidelines for 
employee selection studies are pub- 
lished. 

In Larry P. v. Riles the court rules that 
standardized IQ tests are culturally 
biased against low-functioning black 
children. 

In Parents in Action on Special Educa- 
tion v. Hannon the court rules that stan- 
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1992 
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2003 


dardized IQ tests are not racially or cul- 
turally biased. 

The American Psychological Associa- 
tion and other groups jointly publish the 
influential Standards for Educational 
and Psychological Testing. 

Sparrow and others publish the Vineland 
Adaptive Behavior Scales, a revision of 
the pathbreaking 1936 Vineland Social 
Maturity Scale. 

American Psychiatric Association pub- 
lishes DSM-III-R 

The Lake Wobegon Effect is noted: Vir- 
tually all states of the union claim that 
their achievement levels are above aver- 
age. 

The Minnesota Multiphasic Personality 
Inventory-2 is published, 

American Psychological Association 
publishes a revised Ethical Principles of 
Psychologists and Code of Conduct 
(American Psychologist, December 
1992) 

American Psychiatric Association pub- 
lishes DSM-IV. 

Herrnstein and Murray revive the race 
and IQ heritability debate in The Bell 
Curve. 

APA and other groups publish revised 
Standards for Educational and Psycho- 
logical Testing. 

New revision of APA Ethical Principles 
of Psychologists and Code of Conduct 
goes into effect. 
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Major Tests 
and Their Publishers 


Individual Tests of Intelligence and Adaptive Behavior 


Adaptive Behavior Inventory for Children 

AAMR Adaptive Behavior Scale 

AAMR Adaptive Behavior Scale—School Edition 

Balthazar Scales of Adaptive Behavior 

Bayley Scales of Infant Development-II 

The Blind Learning Aptitude Test 

Bruininks-Oseretsky Test of Motor Proficiency 

Columbia Mental Maturity Scale 

Denver-2 

Detroit Tests of Learning Aptitude—4 

Developmental Indicators for the Assessment of 
Learning—III 

Differential Ability Scales 

Draw-A-Person: Quantitative Scoring System 

Goodenough-Harris Drawing Test 

Hiskey-Nebraska Test of Learning Aptitude 

Independent Living Behavior Checklist 

Kaufman Assessment Battery for Children 

Kaufman Brief Intelligence Test 

Kaufman Adolescent and Adult Intelligence Test 

Leiter International Performance Scale-Revised 

McCarthy Scale of Children’s Ability 

Miller Assessment for Preschoolers 


Psychological Corporation 
Pro-Ed 

Pro-Ed 

Consulting Psychologists Press 
Psychological Corporation 
University of Illinois Press 
American Guidance Service 
Psychological Corporation 
Denver Developmental Materials 
Pro-Ed 

American Guidance Service 


Psychological Corporation 
Psychological Corporation 
Psychological Corporation 
Marshal S. Hiskey 


West Virginia Rehabilitation Center 


American Guidance Service 
American Guidance Service 
American Guidance Service 
Stoelting Company 
Psychological Corporation 
Psychological Corporation 


Ordinal Scales of Psychological Development 
Peabody Picture Vocabulary Test—III 

Scales of Independent Behavior-R 
Stanford-Binet: Fifth Edition 

Test of Nonverbal Intelligence-3 

T. M. R. School Competency Scales 

Vineland Adaptive Behavior Scales 

Wechsler Adult Intelligence Scale-III 
Wechsler Intelligence Scale for Children-III 


Wechsler Preschool and Primary Scale of 
Intelligence—Revised 


Group Intelligence Tests 

Cognitive Abilities Test 

Culture Fair Intelligence Test 
Henmon-Nelson Tests of Mental Ability 
Kuhlmann-Anderson Tests of Mental Ability 
Miller Analogies Test 

Multidimensional Aptitude Battery 
Otis-Lennon School Ability Test 

Ravens Progressive Matrices 

School and College Ability Tests—Series III 
Shipley Institute of Living Scale 

Wonderlic Personnel Evaluation 


Aptitude Tests and Batteries 

American College Testing Assessment Program 
Armed Services Vocational Aptitude Battery 
Differential Aptitude Tests 

General Aptitude Test Battery 

Graduate Record Examinations 

Scholastic Assessment Tests 


Group Achievement Tests 

California Achievement Tests 
Comprehensive Tests of Basic Skills 
Iowa Tests of Basic Skills 

Iowa Tests of Educational Development 
Metropolitan Achievement Tests 
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Uzgiriz & Hunt (1989) 
American Guidance Service 
DLM Teaching Resources 
Riverside Publishing Co. 
Pro-Ed 

Consulting Psychologists Press 
American Guidance Service 
Psychological Corporation 
Psychological Corporation 
Psychological Corporation 


Riverside Publishing Co. 

Institute for Personality and Ability Testing 
Riverside Publishing Co. 

Scholastic Testing Service 
Psychological Corporation 

Sigma Assessment Systems 
Psychological Corporation 
Psychological Corporation [distributor] 
Educational Testing Service 

Western Psychological Services 
Wonderlic Personnel Test 


American College Testing Program 

U.S. Military Entrance Processing Command 
Psychological Corporation 

U.S. Employment Service 

Educational Testing Service 

Educational Testing Service 


CTB/Macmillan/McGraw-Hill 
CTB/Macmillan/McGraw-Hill 
Riverside Publishing Co. 
Riverside Publishing Co. 
Psychological Corporation 
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Sequential Tests of Educational Progress, 
Series III 


SRA Achievement Series 

Stanford Achievement Series 

Stanford Test of Academic Skills 

Tests of Achievement and Proficiency 

Tests of General Educational Development (GED) 


Individual Achievement Tests 
Kaufman Test of Educational Achievement 
Mini-Battery of Achievement 


Peabody Individual Achievement Test— 
Revised 


Wechsler Individual Achievement Test—2 
Wide Range Achievement Test-III 


Woodcock-Johnson Psycho-Educational 
Battery-Revised 


Psychomotor and Dexterity Tests 
Crawford Small Parts Dexterity Test 
Hand-Tool Dexterity Test 

Purdue Pegboard 

Stromberg Dexterity Test 


Clerical Tests 

Clerical Abilities Battery 
General Clerical Test 
Minnesota Clerical Test 
SRA Clerical Aptitudes 


Mechanical Aptitude Tests 

Bennett Mechanical Comprehension Test 
Minnesota Spatial Relations Test 

Revised Minnesota Paper Form Board Test 
SRA Mechanical Aptitudes 


Interest Inventories 

Campbell Interest and Skill Survey 
Jackson Vocational Interest Survey 
Kuder General Interest Survey 


Educational Testing Service 


SRA/London House 
SRA/London House 
Psychological Corporation 
Riverside Publishing Co. 
GED Testing Service 


American Guidance Service 
DLM Teaching Resources 
American Guidance Service 


Psychological Corporation 
Jastak Associates, Inc. 
Riverside Publishing Co. 


Psychological Corporation 
Psychological Corporation 
SRA/London House 

Psychological Corporation 


Psychological Corporation 
Psychological Corporation 
Psychological Corporation 
SRA/London House 


Psychological Corporation 
American Guidance Service 
Psychological Corporation 
SRA/London House 


National Computer Systems 
Sigma Assessment Systems 
SRA/London House 


Kuder Occupational Interest Survey, Revised 
Kuder Preference Record 

Self-Directed Search 

Strong Interest Inventory 


Neuropsychological Tests 

Bender Visual Motor Gestalt Test 
Benton Revised Visual Retention Test 
Finger Localization Test 


Halstead-Reitan Neuropsychological Test 
Battery 


Luria-Nebraska Neuropsychological Battery 
Porteus Maze Test 

Serial Digit Learning Test 

Symbol Digit Modalities Test 
Three-Dimensional Block Construction Test 
Wechsler Memory Scale-Revised 

Wisconsin Card Sorting Test 


Projective Personality Tests 

Children’s Apperception Test 

Draw-A-Person Test 

Holtzman Inkblot Test 

Rorschach 

Rosenzweig Picture-Frustration Study (P-F Study) 
Rotter Incomplete Sentences Blank 

Senior Apperception Technique 

Thematic Apperception Test (TAT) 


Washington University Sentence Completion 
Test 


Self-Report Personality Inventories 
Beck Depression Inventory—Revised 
California Psychological Inventory 
Comrey Personality Scales 

Edwards Personal Preference Schedule 
Eysenck Personality Questionnaire 
Jenkins Activity Survey 
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SRA/London House 

SRA/London House 

Psychological Assessment Resources 
Consulting Psychologists Press 


Western Psychological Services 
Psychological Corporation 

Oxford University Press 

Reitan Neuropsychology Laboratories 


Western Psychological Services 
Psychological Corporation 

Oxford University Press 

Western Psychological Services 
Oxford University Press 
Psychological Corporation 
Psychological Assessment Resources 


C. P. S., Inc. 

Charles C. Thomas 
Psychological Corporation 
Hogrefe & Huber Publishers 
Saul Rosenzweig 
Psychological Corporation 
C PRS, ine. 

Harvard University Press 
Jossey-Bass 


Psychological Corporation 

Consulting Psychologists Press 

EdITS, Educational and Industrial Testing Service 
Psychological Corporation 

EdITS, Educational and Industrial Testing Service 
Psychological Corporation 
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Millon Clinical Multiaxial Inventory—3 
Minnesota Multiphasic Personality Inventory —2 
Myers-Briggs Type Indicator 

NEO Five Factor Inventory 

NEO Personality Inventory—Revised 
Personality Inventory for Children—2 
Personality Research Form 

Sixteen Personality Factor Questionnaire (16PF) 
State-Trait Anxiety Inventory 

Survey of Work Styles 


Forensic Tests 
Custody Quotient™ 
Parent-Child Relationship Inventory 


Rogers Criminal Responsibility Assessment 
Scales 


National Computer Services 

National Computer Systems 

Consulting Psychologists Press 

Sigma Assessment Systems 

Sigma Assessment Systems 

Western Psychological Services 

Sigma Assessment Systems 

Institute for Personality and Ability Testing 
Mind Garden 

Sigma Assessment Systems 


Wilmington Institute 
Western Psychological Services 
Psychological Assessment Resources 
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This table lists the equivalence between per- 
centile ranks and four other types of scores: z scores 
(mean of 0, SD of 1.00), deviation IQs (mean of 100, 
SD of 15), T scores (mean of 50, SD of 10), and 
GRE-like scores (mean of 500, SD of 100). The ap- 
plication of the table assumes that the distribution of 
scores on a test or variable is normally distributed. 

We illustrate how this appendix can be used 
with two examples. Suppose that we desire to know 


GRE-Like 
Score 


Deviation 


1Q 


T Score 
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Standard and Standardized- 
‚Score Equivalents of 
Percentile Ranks in a 
Normal Distribution 


the WAIS-R IQ that is equivalent to a percentile 
rank of 97. Reading across the row that begins with 
PR 97, we discover that the equivalent IQ is 128. 
Suppose that we desire to know the percentile rank 
that is equivalent to a GRE score of 675. In the far 
right column, we locate a score of 675 and read 
across to the left-hand column to discover that the 
equivalent percentile rank is 96. 


Deviation GRE-Like 
z IQ T Score Score 
1.48 122 65 648 
1.41 121 64 641 
1.34 120 63 634 
1.28 119 63 628 
1.22 118 62 622 
1.18 118 62 618 
1.13 117 61 613 
1.08 116 61 608 
1.04 116 60 604 

(continued) 
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Deviation GRE-Like Deviation GRE-Like 
IQ T Score Score z IQ T Score Score 





PR42 -0.20 97 48 480 
41 -0.23 96 48 477 
40. -0.25 96 47 475 
39 0.28 96 47 472 
38  -0,31 95 47 469 
27.038 95 47 467 
36 -0.36 95 46 464 
35 -0.39 94 46 461 
34-041 94 459 
33 0.44 93 456 
32 -0.47 93 45 453 
31 -0.49 93 45 451 
30 0.52 92 45 448 
29 -0.55 92 44 445 
28 0.58 91 44 442 
27-061 90 439 
26 -0.64 90 436 
25 -0.67 90 43 433 
24 -0.71 89 43 429 
23° 0.74 89 43 426 
22° 0.77 88 “42 423 
21 0.80 88 42 420 
20 0.84 87 42 416 
19 -0.88 87 41 412 
18-091 86 41 409 
17 0.95 86 40 405 
16 -0.99 85 40 401 
15 -1.04 84 396 
14 -1.08 84 39 392 
EEE T, 83 39 387 
12 -1.18 82 38 382 
Yr i 82 38 378 
10 -1.28 81 37 372 

9 -1.34 80 37 366 
8 -14 79 36 359 
7- 48 78 35 352 
ER Si. 77 34 345 
5 -1.64 75 34 336 
4 -175 74 32 325 
EE: 72 31 312 
. 3.08 69 29 295 
in 65 27 267 


| Glossary 


accommodation in Piaget’s theory, the adjustment of 
an unsuccessful schema so that it works. 

achievement test a test that measures the degree of 
learning, success, or accomplishment in a subject 
matter, 

actuarial judgment the kind of automated judgment 
in which an empirically derived formula is used to 
diagnose or predict behavior. 

adverse impact in hiring, adverse impact is said to 
exist if one group has a selection rate less than four- 
fifths of the rate of the group with the highest selec- 
tion rate (Uniform Guidelines on Employee Selection, 
1978). 

age norm a type of standardization that depicts the 
level of test performance for each separate age group 
in the normative sample. 

alternate-forms reliability a form of reliability in 
which alternate forms of the same test are given to a 
group of heterogeneous and representative subjects; 
scores for the two forms are then correlated. 

Alzheimer’s disease a degenerative neurological dis- 
order; in the early stages, the most prominent symp- 
tom is memory loss. 

Americans with Disabilities Act an act passed by 
Congress in 1990 that forbids discrimination against 
qualified individuals with disabilities. 

aphasia any deviation in language performance caused 
by brain damage. 

apraxia variety of dysfunctions characterized by a 
breakdown in the direction or execution of complex 
motor acts. 

aptitude test a test that measures one or more clearly 
defined and relatively homogeneous segments of 
ability. 

architectural system likened to “hardware” in the in- 
formation-processing approach to intelligence, the 
architectural system refers to biologically based 
properties (e.g., memory span, speed of encoding) 
necessary for information processing. 

assessment appraising or estimating the level or mag- 
nitude of some attribute of a person; testing is one 


small part of assessment which also incorporates ob- 
servations, interviews, rating scales, and checklists. 

assessment center an approach to assessment of man- 
agerial talent which consists of multiple simulation 
techniques, including group presentations, problem- 
solving exercises, group discussion exercises, inter- 
views, and in-basket techniques. 

assimilation in Piaget’s theory, the application of a 
schema to an object, person, or event. 

attention-deficit/hyperactivity disorder a behavioral 
syndrome characterized by fidgeting, distractibility, 
impulsivity, attentional deficits, poor social skills, 
and not considering consequences, 

attitude learned cognitive, affective, and behavioral 
predispositions to respond positively or negatively to 
certain objects, situations, institutions, concepts, or 
persons. 

basal ganglia a collection of nuclei in the forebrain 
that make connections with the cerebral cortex above 
and the thalamus below; the basal ganglia participate 
in the control of movement. 

basal level for tests in which subtest items are ranked 
from easiest to hardest, the level below which the ex- 
aminee would almost certainly answer all questions 
correctly. 

base rate in decision theory, the proportion of suc- 
cessful applicants who would be selected using cur- 
rent methods, without benefit of the new test. 


| behavior observation scale a variation upon the 
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BARS technique which uses a continuum from “al- 
most never” to “almost always” to measure how 
often an employee performs specific tasks on each 
behavioral dimension. 

behavior sample _ in testing, the notion that a test is just 
a sample of behaviors that permits the examiner to 
make inferences about a larger domain of relevant 
behaviors. 

behavior therapy the application of the methods and 
findings of experimental psychology to the modifi- 
cation of maladaptive behavior; also called behavior 
modification. 
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behavioral assessment a variety of techniques that 
concentrate on behavior itself rather than on under- 
lying traits, hypothetical causes, or presumed di- 
mensions of personality. 

behavioral avoidance test a behavioral procedure in 
which the therapist measures how long the client can 
tolerate an anxiety-inducing stimulus. 

behavioral procedure a procedure for assessing the an- 
tecedents and consequences of behavior; behavioral 
procedures include checklists, rating scales, inter- 
views, and structured observations. 

behaviorally anchored rating scale a criterion-refer- 
enced rating scale. 

bias in construct validity a type of bias demonstrated 
when a test is shown to measure different hypothet- 
ical traits (psychological constructs) for one group 
than another or to measure the same trait but with 
differing degrees of accuracy. 

bias in content validity a type of bias demonstrated 
when an item or subscale is relatively more difficult 
for members of one group than another after the gen- 
eral ability level of the two groups is held constant. 

bias in predictive validity a type of bias demonstrated 
when the inference drawn from the test score is not 
made with the smallest feasible random error or if 
there is constant error in an inference or prediction 
as a function of membership in a particular group. 

biodata objective or scoreable autobiographical data; 
recognized as a valid adjunct to personnel selection. 

C scale a variant on the stanine scale with 11 units. 

ceiling level for tests in which subtest items are ranked 
from easiest to hardest, the level above which the ex- 
aminee would almost certainly fail all remaining 
questions. 

cerebellum part of the hindbrain responsible for help- 
ing to coordinate muscle tone, posture, and skilled 
movements. : 

cerebral cortex the outermost layer of the brain that is 
the source of the highest levels of sensory, motor, and 
cognitive processing. 

certification testing to determine that a person has at 
least a minimum proficiency in some discipline or 
activity. 

classical theory of measurement the dominant theory 
in psychological testing; the theory assumes that an 
observed score consists of a true score plus mea- 
surement error. 

classification in testing, the process of using tests to 
assign a person to one category rather than another. 


clerical scoring error in testing, an error in test scor- 
ing related to the mechanics of scoring, such as 
adding subscores incorrectly or consulting the wrong 
conversion table. 

clinical judgment the kind of judgment in which the 
decision maker processes information in his or her 
head to diagnose or predict behavior. 

coaching in testing, the attempt to boost test scores by 
providing the examinee with extra practice on test- 
like materials, review of fundamental concepts likely 
to be covered by the test, and advice about optimal 
test-taking strategies. 

coefficient alpha an index of reliability that may be 
thought of as the mean of all possible split-half co- 
efficients, corrected by the Spearman-Brown for- 
mula. 

cognitive behavior therapy an approach to behavior 
change that emphasizes changing the client’s belief 
structure. 

competency to stand trial the determination by the 
presiding judge that a defendant does not have a 
mental defect, illness, or condition that renders him 
or her unable to understand the proceedings or to as- 
sist in his or her defense. 

componential intelligence in Sternberg’s theory, the 
internal mental mechanisms that are responsible for 
intelligent behavior. 

computer-assisted psychological assessment CAPA 
refers to the entire range of computer applications in 
psychological assessment and includes testing, scor- 
ing, report writing, and individualized test adminis- 
tration. 

computer-based test interpretation CBTI refers to 
test interpretation and report writing by computer, 
which is a major component of computer-assisted 
psychological assessment (CAPA). 

computerized adaptive testing a family of procedures 
that allows for accurate and efficient measurement 
of ability; individualized testing continues until a 
predetermined level of measurement precision is 
reached. 

concurrent validity a type of criterion-related validity 
in which the criterion measures are obtained at ap- 
proximately the same time as the test scores. 

concussion a transitory alteration of consciousness 
from a blow to the head; may be followed by tem- 
porary amnesia, dizziness, nausea, weak pulse, and 
slow respiration, yet there is no demonstrable or- 
ganic brain damage. 


conservation in Piaget’s theory, the awareness: that 
physical quantities do not change in amount when 
they are superficially altered in appearance: 

construct a theoretical, intangible quality or trait in 
which individuals differ. 

construct validity a type of validity that refers to the 
appropriateness of test-based inferences about the un- 
derlying construct purportedly measured by the test. 

constructional dyspraxia impairment of the ability to 
deal with spatial relationships either in a two- or 
three-dimensional framework. 

consumer psychology the branch of industrial/organi- 
zational psychology that deals with the develop- 
ment, advertising, and marketing of products and 
services. 

content validity the type of validity that is determined 
by the degree to which the questions, tasks, or items 
on a test are representative of the universe of behav- 
ior the test was designed to sample. 

contextual intelligence in Sternberg’s theory, the men- 
tal activity involved in purposive adaptation to, shap- 
ing of, and selection of real-world environments 
relevant to one’s life. 

contingency management procedure an approach to 
behavior therapy in which the therapist identifies and 
alters the consequences of unwanted behaviors. 

convergent validity a type of validity that is demon- 
strated when a test correlates highly with other vari- 
ables or tests with which it shares an overlap of 
constructs. 

corpus callosum the major commissure that serves to 
integrate the functions of the two cerebral hemi- 
spheres. 

correction for guessing in group testing, the practice 
of revising a subject’s final score in light of apparent 
guessing. 

correlation coefficient a numerical index of the degree 
of linear relationship between two sets of scores; cor- 
relation coefficients can vary between -1.00 and 
+1.00. 

correlation matrix a complete table of intercorrela- 
tions between all the variables that is the beginning 
point of factor analysis. 

cranial nerves twelve paired neural tracts that help 
govern basic sensory and motor functions such as vi- 
sion, smell, facial movement, taste, and hearing. 

creativity test a test that assesses the ability to produce 
new ideas, insights, or artistic creations that are ac- 
cepted as being of social, aesthetic, or scientific value. 
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criterion contamination a source of error in test vali- 
dation when the criterion is “contaminated” by its ar- 
tificial commonality with the test, such as test and 
criterion contain nearly identical items. Also, a form 
of evaluation error in which a criterion measure in- 
cludes factors that are not demonstrably part of the 
job, for example, rating appearance when it is not 
job related. 

criterion-keyed approach a test development ap- 
proach in which test items are assigned to a particu- 
lar scale if, and only if, they discriminate between a 
well-defined criterion group and a relevant control 
group. 

criterion problem the difficult problem of conceptu- 
alizing and measuring work performance constructs 
which are often complex, fuzzy, and multidimen- 
sional. 

criterion-referenced test a test in which the objective 
is to determine where the examinee stands with re- 
spect to very tightly defined educational objectives; 
no comparison is made to the performance of other 
examinees. 

criterion-related validity | the type of validity that is 
demonstrated when a test is shown to be effective in 
estimating an examinee’s performance on some out- 
come measure. 

critical incidents checklist a form of performance 
evaluation based upon actual episodes of desirable 
and undesirable on-the-job behavior. 

cross-sectional design. a research design in which 
subjects of different ages are tested at one point in 
time. 

cross-sequential design a research design that com- 
bines cross-sectional and longitudinal methods. 

cross-validation for predictive tests, the practice of 
using the original regression equation in a new sam- 
ple to determine whether the test predicts the crite- 
rion as well as it did in the original sample. 

crystallized intelligence in Cattell and Horn’s theory, 
what one has already learned through the investment 
of fluid intelligence in cultural settings (e.g., learn- 
ing algebra in school). 

culture-fair test atest designed to minimize irrelevant 
influences of cultural learning and social climate and 
thereby produce a cleaner separation of natural abil- 
ity from specific learning. 

custody evaluation in divorce cases, the psychological 
evaluation of a child (or children) and both parents 
so as to offer an opinion to the court as to the best 
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interests of the child (or children) in custody arrange- 
ments. 

decision theory an approach to psychological mea- 
surement that considers the costs and benefits of test- 
based decisions, for example, in personnel selection. 

defense mechanisms unconscious mental strategies 
available to the ego in dealing with the conflicting 
demands of id, superego, and external reality. 

diagnosis determining the nature and source of a per- 
son’s abnormal behavior, and classifying the behav- 
ior pattern within an ‘accepted diagnostic system. 

discriminant validity a type of validity that is demon- 
strated when a test does not correlate with variables 
or tests from which it should differ. 

divergent production the creation of numerous ap- 
propriate responses to a single stimulus situation. 

divergent thinking the kind of thinking that goes off 
in different directions. 

Durham rule the legal provision for the defense of in- 
sanity if the criminal act was a “product” of mental 
disease or defect; dropped in 1972 and replaced by 
the Model Penal Code. 

duty towarn stemming from the Tarasoff case, the re- 
sponsibility of clinicians to communicate any seri- 
ous threat to the potential victim, law enforcement 
agencies, or both. 

dysarthria slurred, hesitant speech (not drug or alco- 
hol induced) that often signifies damage to the cere- 
bellum. 

ecological momentary assessment using wireless 
technology to measure patient experience (e.g., 
pain, fatigue, mood) in the real world at the point of 
experience. 

ego in psychoanalytic theory, the conscious self that 
mediates between the id and reality. 

equilibration in Piaget’s theory, the entire process of 
assimilation, accommodation, and equilibrium. 

executive functions brain functions that include logi- 
cal analysis, conceptualization, reasoning, planning, 
and flexibility of thinking. 

executive system likened to “software” in the infor- 
mation-processing approach to intelligence, the ex- 
ecutive system refers to environmentally learned 
components that steer problem solving and provide 
overall guidance. 

expectancy table a table that portrays the established 
relationship between test scores and expected out- 
come on a relevant task. 

experiential intelligence in Sternberg’s theory, the 
ability to deal effectively with novel tasks. 


expert rankings a scaling method that relies upon the 
judgment of experts to determine the rankings for in- 
dividual components. 

expert witness in court cases, a witness whom the 
judge deems qualified to testify about a proper sub- 
ject matter. 

extravalidity concerns the side effects and unintended 
consequences of testing. 

extraversion a sociable, outgoing, excitement-seeking 
personality disposition. 

extrinsic religious expression the use of religion for 
external goals such as security, status, and friendship. 

face validity for tests, the appearance of validity to test 
users, examiners, and especially the examinees; not 
a technical form of validity, but important for the so- 
cial acceptability of a test. 

factor an underlying construct or variable that helps 
explain the correlations between several tests or 
measures. 

factor analysis a family of statistical procedures that 
researchers use to summarize relationships among 
variables that are correlated in highly complex ways; 
the goal of factor analysis is to derive a parsimonious 
set of derived factors. 

factor loading in factor analysis, the correlation be- 
tween an individual test and a single factor. 

factor matrix a table of correlations between variables 
and factors; the correlations are called factor loadings. 

false negatives in decision theory, a subject who is in- 
correctly predicted to fail on the criterion. 

false positives in decision theory, a subject who is in- 
correctly predicted to succeed on the criterion. 

fear survey schedule a behavioral assessment device 
which requires respondents to indicate the presence 
and intensity of their fears in relation to various stim- 
uli, typically on a 5- or 7-point Likert scale. 

fetal alcohol effect a subtle version of fetal alcohol 
syndrome in which physical abnormalities are not 
observed, but behavioral problems such as atten- 
tional difficulties are noted. 

fetal alcohol syndrome a cluster of physical and be- 
havioral abnormalities, including mental retardation, 
caused by the mother’s drinking of alcohol during 
pregnancy. 

fluid intelligence in Cattell and Horn’s theory, a 
largely nonverbal and relatively culture-reduced 
form of mental efficiency. 

forced-choice method in personality test develop- 
ment, an item-writing method in which the alterna- 
tives are matched for social desirability. 


forced-choice scale a performance evaluation scale de- 
signed to eliminate bias and subjectivity in supervi- 
sor ratings by forcing a choice between options that 
are equal in social desirability. 

forebrain the large, outermost portion of the brain con- 
sisting of the cerebral cortex and underlying struc- 
tures such as the corpus callosum, basal ganglia, 
limbic lobe, thalamus, and hypothalamus. 

freedom from distractibility the third factor on the 
WISC-III consisting of Arithmetic and Digit Span. 

frequency distribution a method of summarizing data 
or test scores by specifying a small number of usu- 
ally equal-sized class intervals and then tallying how 
many scores fall within each interval. 

frequency polygon a method of summarizing data or 
test scores in graphic form; similar to a histogram, 
except that the frequency of the class intervals is rep- 
resented by single points rather than columns. 

frontal lobe the part of the cerebral cortex at the front 
of the brain that is required for the programming, 
regulation, and verification of executive functions 
and motor performance. 

frustration in Rosenzweig’s system, the state that oc- 
curs whenever an organism encounters an obstacle 
or obstruction en route to the satisfaction of a need. 

functionalist definition of validity the view that a test 
is valid if it serves the purpose for which it is used. 

fundamental lexical hypothesis in personality theory, 
the notion that trait terms have survived in language 
because they convey important information about 
our dealings with others. 

general factor according to Spearman, the single gen- 
eral factor of intelligence that must exist to account 
for the observed correlations between a large num- 
ber of tests. 

generalizability theory a domain sampling model of 
reliability that recognizes several alternatives of gen- 
eralization for test results. 

gifted the designation of a person as gifted typically 
means that he or she has extraordinary ability in 
some area. 

grade norm a type of standardization that depicts the 
level of test performance for each separate school 
grade in the normative sample. 

graphic rating scale a scale that consists of trait la- 
bels, brief definitions of those labels, and a contin- 
uum for the rating. 

group achievement tests also called educational 
achievement tests, these instruments are commonly 
administered to dozens or hundreds of students at the 
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same time to gauge achievement levels in one or 
more well-defined academic domains. 

grouptests mainly pencil-and-paper measures suitable 
to the testing of large groups of persons at the same 
time. 

guilty but mentally ill (GBMI) a verdict allowed in 
some states in which the intention is for the accused 
to begin his or her sentence in a psychiatric hospital. 

halo effect the tendency to rate an employee high or 
low on all dimensions because of a global impression. 

heritability index an estimate of how much of the total 
variance in a given trait is due to genetic factors; the 
index can vary from 0.0 to 1.0. 

hindbrain the lowest, most simply organized, brain 
structures; the hindbrain consists of the myelen- 
cephalon and metencephalon. 

hippocampus part of a complex, ill-defined memory 
circuit that consolidates new experiences into long- 
term memories. 

histogram a method of summarizing data or test scores 
in graphic form; a histogram contains the same in- 
formation as a frequency distribution. 

homogeneous scale a scale in which the individual 
items tend to measure the same thing; homogeneity 
is gauged by item-total correlations. 

hypothalamus a small structure at the center of the 
brain that helps govern motivated behavior and bod- 
ily regulation: feeding, sexual behavior, sleeping, 
temperature regulation, emotional behavior, and 
movement. 

id in psychoanalytic theory, the unconscious part of 
personality that is the seat of all instinctual needs 
such as for food, water, sexual gratification, and 
avoidance of pain. 

illusory validation in projective testing, the finding 
that subjects ignore disconfirming instances and 
cling to their preexisting stereotypes. 

implicit association test a covert measure of attitudes 
that makes use of automatic or “unconscious” asso- 
ciations to target concepts (e.g., racial groups) as de- 
termined by sophisticated reaction time analyses. 

in-basket technique a realistic work sample test that 
simulates the work environment of an administrator. 

index of intellectual deterioration on the Shipley In- 
stitute of Living Scale, an index based on the dis- 
crepancy between verbal and abstractions ability that 
was intended to gauge the effects of organic brain 
impairment. 

individual achievement tests achievement tests ad- 
ministered one-on-one to gauge achievement levels; 
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these tests are essential in the assessment of poten- 
tial learning disabilities. 

individual tests instruments which by their design and 
purpose must be administered one on one. 

informed consent in testing, the principle that test tak- 
ers or their representatives are made aware, in lan- 
guage that they can understand, of the purposes and 
likely consequences of testing. 

insanity plea in court cases, a defense based upon ref- 
erence to legal insanity as spelled out by the Model 
Penal Code or other legal statutes. 

integrative model a model of career assessment in 
which information from interest, ability, and person- 
ality domains is considered simultaneously. 

integrity test an instrument designed to screen poten- 
tial employees for theft-proneness and other undesir- 
able qualities; overt integrity tests contain questions 
about attitudes toward theft and items dealing with 
admission of theft and other illegal activities. 

intelligence - according to experts, (1) the capacity to 
learn from experience, and (2) the capacity to adapt 
to one’s environment. 

intelligence test although there are exceptions, an intel- 
ligence test generally yields an overall summary score 
based on results from a heterogeneous sample of 
items (e.g., verbal skills, reasoning, spatial thinking). 

interest inventory a test that measures the preference 
for certain activities or topics and thereby helps de- 
termine occupational choice. 

interscorer reliability for tests that involve judgmen- 
tal scoring, the typical degree of agreement between 
scorers. 

interval scale a measurement scale that provides in- 
formation about ranking and the relative strength of 
ranks; based on the assumption of equal-sized units 
or intervals for the underlying scale. 

intrinsic religious expression the use of religion for 
internal goals such as finding meaning and direction 
in life. 

introversion a quiet, “bookish,” reserved personality 
disposition. 

ipsative test a test in which the average of the sub- 
scales is always the same for every examinee; thus, 
for an individual examinee, high scores on subscales 
must be balanced by low scores on other subscales. 

IQ constancy On the Wechsler tests, the axiomatic as- 
sumption that IQ must remain constant with normal 
aging, even though raw intellectual ability might 
shift or decline. 

item-characteristic curve a graphical display of the 
relationship between the probability of a correct re- 


sponse and the examinee’s position on the underly- 
ing trait measured by the test. 

item-difficulty index for a single test item, the pro- 
portion of examinees in a large tryout sample who 
get that item correct. 

item-discrimination index a statistical index of how 
efficiently an item discriminates between persons 
who obtain high and low scores on the entire test. 

item-reliability index  s;r;„, the product of a test item’s 
internal consistency as indexed by the correlation 
with the total score (r;7) and its variability as indexed 
by the standard deviation (s,). 

item response theory also known as latent trait theory, 
a modern framework for test construction in which 
the psychometrician posits a single dimension of 
skill or underlying trait on which all of the test items 
rely; each respondent is assumed to have a certain 
amount of the latent trait being measured. 

item-validity index _s;r,. consists of the product of a test 
item’s standard deviation (s,) and the point-biserial 
correlation with the criterion rj. 

job analysis the process of defining a job in terms of 
the behaviors necessary to perform it; includes job 
description (physical characteristics of the work) and 
job specification (personal characteristics needed). 

Kuder-Richardson formula 20 an index of reliabil- 
ity that is relevant to the special case where each test 
item is scored 0 or 1 (e.g., right or wrong). 

Lake Wobegon effect the observation that virtually all 
states of the union claim that average achievement 
scores for their school systems exceed the 50th per- 
centile. 

latent trait theory a modern framework for test con- 
struction in which a single dimension of skill or un- 
derlying trait is posited. See item response theory. 

learning disability an indistinct concept that typically 
refers to a severe discrepancy between general ability 
and individual achievement that cannot be explained 
by sensory/motor handicaps, mental retardation, 
emotional problems, or cultural deprivation. 

legally blind this term applies to individuals with cen- 
tral visual acuity of 20/200 or less in the better eye 
(with correction) or to those with significant reduction 
in their visual field to a diameter of 20 degrees or less; 
used to determine eligibility for government benefits. 

Likert scale a scale that presents the examinee with 
five responses ordered on an agree/disagree or ap- 
prove/disapprove continuum. 

limbic lobe a group of subcortical structures responsi- 
ble for elaboration of emotion and the control of vis- 
ceral activity. 





local norms norms derived from a representative local 
sample, as opposed to a national sample. 

locus of control a construct that refers to perceptions 
that people have about the source of things that hap- 
pen to them (e.g., internal versus external). 

longitudinal design a research design in which the 
same subjects are tested at several points in time. 

mean the arithmetic average of a group of scores. 

measurementerror everything other than the true score 
that makes up an examinee’s obtained test score. 

median the middlemost score when all the scores in a 
sample have been ranked. 

medulla oblongata part of the hindbrain that helps me- 
diate swallowing, vomiting, breathing, the control of 
blood pressure, respiration, and, partially, heart rate. 

mental retardation significantly subaverage general 
intellectual functioning resulting in or associated 
with concurrent impairments in adaptive behavior 
and manifested during the developmental period. 

mental state at the time of the offense (MSO) the 
mental state of a defendant at the time of the offense 
is relevant in special pleadings such as the insanity 
defense; psychologists and psychiatrists may offer 
opinions as to the MSO of defendants. 

method of absolute scaling a procedure for obtaining 
a measure of absolute item difficulty based upon re- 
sults for different age groups of test takers. 

method of empirical keying a scale development 
method in which test items are selected based en- 
tirely on how well they contrast a criterion group 
from a normative sample. 

method of equal-appearing intervals a method for 
constructing interval-level scales from attitude state- 
ments. 

method of rational scaling a scale construction method 
in which all scale items correlate positively with each 
other and also with the total score for the scale; also 
known as the internal consistency approach. 

midbrain the middle portion of the brain consisting 
of cranial nerves and relay stations for vision and 
hearing. 

mixed-standard scale a complex approach to perfor- 
mance evaluation designed to minimize rating errors 


in performance appraisal; items for performance di-. 


mensions are randomly ordered on the scale. 
M’Naughten rule one of several standards of legal in- 
sanity; essentially, “the party accused was laboring 
under such a defect of reason, from disease of the 
mind, as not to know the nature and quality of the act 
he was doing. . . .” 
mode the most frequently occurring score. 
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Model Penal Code rule a standard of legal insanity—“A 
person is not responsible for criminal conduct if at the 
time of such conduct, as a result of mental disease or 
defect, he lacks substantial capacity either to appreci- 
ate the criminality [wrongfulness] of his conduct or to 
conform his conduct to the requirements of the law.” 

moral dilemma a brief story that involves a difficult 
moral choice such as whether to steal to prolong 
someone’s life; used in the study of moral reasoning. 

multimedia the collective capacity of the modern 
computer to use still images, live video segments, 
music, tables, charts, animation, and other ap- 
proaches in interactive format. 
multitrait-multimethod matrix a research design for 
` assessing convergent and discriminant validity that 
calls for the assessment of two or more traits by two 
or more methods. 

neuropsychological tests tests and procedures with 
proven sensitivity to the effects of brain damage. 

neuropsychology the study of the relationship be- 
tween brain function and behavior. 

nominal scale a measurement scale in which the cate- 
gories are arbitrary and do not designate “more” or 
“less” of anything; the simplest and lowest level of 
measurement. 

nonverbal behavior the subtler forms of human com- 
munication contained in glance, gesture, body lan- 
guage, tone of voice, and facial expression. 

norm group a sample of examinees who are represen- 
tative of the population for whom the test is intended. 

norm-referenced test a test in which the performance 
of each examinee is interpreted in reference to a rel- 
evant standardization sample. 

normal distribution a symmetrical, mathematically 
defined, bell-shaped frequency distribution. 

normal ogive the normal distribution graphed in cu- 
mulative form. 

normalized standard score a score obtained by a 
transformation that renders a skewed distribution 
into a normal distribution. 

norms asummary of test results for a large and repre- 
sentative group of subjects. 

not guilty by reason of insanity (NGRI) a verdict al- 
lowed in some states in which the defendant is found 
not guilty because his or her criminal act was the re- 
sult of mental disease or defect. 

oblique axes in factor analysis, the assumption that 
factors are correlated with one another, that is, not at 
right angles. 

occipital lobe the part of the cerebral cortex at the rear 
of the brain that contains the vision centers. 
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occupational reinforcer patterns an evaluation of 
jobs in terms of the worker-perceived reinforcers that 
are present or absent. 

operational definition a definition of a concept in 
terms of the way it is measured, such as, intelligence 
is “what the tests test.” 

ordinal scale a measurement scale that allows for 
ranking; ordinal scales do not provide information 
about the relative strength of ranking. 

orthogonal axes in factor analysis, the assumption that 
the factors are at right angles to one another, which 
means that they are uncorrelated. 

overt integrity test an employment test that seeks to 
assess attitudes toward theft; these instruments may 
also contain a section dealing with overt admissions 
of theft. 

paralinguistics the nonverbal aspects of speech such 
as tone of voice and rate of speaking. 

parietal lobe the part of the cerebral cortex that medi- 
ates spatial integration and sensory awareness of 
what is happening on the surface of the body. 

Parkinson’s disease a degenerative brain disease char- 
acterized by three types of motor disturbance: invol- 
untary movement, including tremor; poverty and 
slowness of movement without paralysis; and 
changes in posture and muscle tone. 

percentile the percentage of persons in the standard- 
ization sample who scored below a specific raw 
score; percentiles vary from 0 to 100. 

perceptual organization the second factor on the 
WISC-III consisting of Picture Arrangement, Picture 
Completion, Block Design, and Object Assembly. 

personal injury in personal injury lawsuits, attorneys 
may hire psychologists to testify as to the lifelong 
consequences of traumatic stress or acquired brain 
damage. 

personality an inexplicit construct which is invoked to 
explain behavioral consistency within persons and 
behavioral distinctiveness between persons. 

personality coefficient a term used to refer to the find- 
ing that the predictive validity of personality scales 
rarely exceeds .30. 

personality test a test that measures the traits, quali- 
ties, or behaviors that determine a person’s individ- 
uality; this information helps predict future behavior. 

pineal body a pea-sized structure that sits at the center 
of the brain; it secretes the hormone melatonin in a 
cyclic biological rhythm, but its functions are not 
well understood. 

placement in testing, the sorting of persons into dif- 
ferent programs appropriate to their needs or skills. 


polygraph a device that monitors ongoing physiolog- 
ical responses, including changes in breathing, pulse 
rate, blood pressure, and perspiration; inaccurately 
referred to as a “lie detector.” 

power test a test that allows enough time for test tak- 
ers to attempt all items; however, the test is difficult 
enough that no test taker is able to obtain a perfect 
score, 

predictive validity a type of criterion-related validity 
in which the criterion measures are obtained in the 
future, usually months or years after the test scores 
are obtained, such as when college grades are pre- 
dicted from an entrance exam. 

primary mental abilities the seven group factors of 
intelligence posited by Thurstone. 

processing speed the fourth factor on the WISC-III 
consisting of Coding and Symbol Search. 

projective hypothesis the assumption that personal in- 
terpretations of ambiguous stimuli must necessarily 
reflect the unconscious needs, motives, and conflicts 
of the examinee. 

projective test a test in which the examinee encoun- 
ters vague, ambiguous stimuli and responds with his 
or her own constructions. 

psychometrician a specialist in psychology or educa- 
tion who develops and evaluates psychological tests. 

Public Law 93-112 is a “Bill of Rights” for persons 
with disabilities that outlawed discrimination based 
upon disability. 

Public Law 94-142 the Education for All Handicapped 
Children Act that mandated that schoolchildren with 
disabilities receive appropriate assessment and edu- 
cational opportunities. 

Public Law 99-457 legislation that requires states to 
provide a free appropriate public education to chil- 
dren ages 3 through 5 who have disabilities. 

pupillometrics the measurement of pupil size to gauge 
interest in, or pleasure in, the observed stimulus. 

Q-technique a technique for studying changes in self- 
concept and other variables by the sorting of state- 
ments into a near-normal distribution for assigned 
categories. 

qualified individualism in testing for selection, the 
ethical stance that age, sex, race, or other demo- 
graphic characteristics must not be used, even if 
knowledge of these factors would improve the va- 
lidity of selection. 

quotas in testing for selection, the ethical stance that 
the best-qualified candidates within definable sub- 
groups should be selected in proportion to their rep- 
resentation in the population. 


random sampling a selection strategy in which every 
subject has an equal chance of being chosen. 

rapport in testing, acomfortable, warm atmosphere that 
serves to motivate examinees and elicit cooperation. 

Rasch Model named after the Danish mathematician 
Georg Rasch, this mathematical model uses complex 
equations to predict the probability of respondents at 
different skill levels correctly answering test questions. 

rater bias the tendency for supervisor ratings to be in- 
accurate because of leniency, severity, and other 
forms of evaluation errors. 

ratio scale a measurement scale that yields equal-sized 
units or intervals and that possesses a conceptually 
meaningful zero point; the highest level of measure- 
ment. 

raw score the most basic level of information provided 
by a psychological test, for example, the number of 
questions answered correctly. 

real definition a definition that seeks to tell us the true 
nature of the thing being defined. 

regression equation an equation that describes the 
best-fitting straight line for estimating the criterion 
from the test; the best-fitting line is one that mini- 
mizes the sum of the squared deviations from the line. 

reliability the attribute of consistency in measurement. 

reliability coefficient the ratio of true score variance to 
the total variance of test scores. 

religion as Quest the view that complexity, doubt, and 
tentativeness are aspects of mature religious expression. 

restriction of range a phenomenon in which the range 
on a variable is restricted, causing correlations with 
other variables to be artificially low. 

reticular formation a network of ascending and de- 
scending nerve cell bodies and fibers that governs 
general arousal or consciousness. 

RIASEC model a theory of person-environment types 
that proposes six themes: Realistic, Investigative, 
Artistic, Social, Enterprising, and Conventional 
(RIASEC). 

rotation to positive manifold in factor analysis, a 
method of rotating the factor matrix that seeks to 
eliminate as many of the negative factor loadings as 
possible. 

rotation to simple structure in factor analysis, a 
method of rotating the factor matrix that seeks to 
simplify the factor loadings so that each test has sig- 
nificant loadings on as few factors as possible. 

routing procedure in tests such as the Stanford-Binet: 
Fifth Edition, the first items or subtests administered 
for the purpose of determining the appropriate start- 
ing points for subsequent subtests. 
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routing test an initial subtest used to determine the 
entry level for all remaining subtests; used with in- 
dividual intelligence tests such as the SB:FE. 

savant an individual who has mental deficiencies and 
ahighly developed talent in a single area such as art, 
rapid calculation, memory, or music. 

schema in Piaget’s theory, an organized pattern of be- 
havior or a well-defined mental structure that leads 
to knowing how to do something. 

screening the use of quick and simple tests or proce- 
dures to identify persons who might have special 
characteristics or needs. 

self-efficacy in Bandura’s theory, the personal judg- 
ment of how well one can execute courses of action 
required to deal with prospective situations. 

self-monitoring a therapeutic approach in which the 
client chooses the goals and actively participates in 
supervising, charting, and recording progress toward 
the end point(s) of therapy. 

semantic differential a rating technique in which the 
subject uses a seven-point continuum to rate a con- 
cept on a number of bipolar adjectives such as good- 
bad, strong-weak, active-passive. 

simultaneous processing a form of information pro- 
cessing characterized by the simultaneous execution 
of several different mental operations. 

situational exercise an assessment procedure in which 
the prospective employee is asked to perform under 
circumstances that are highly similar to, the antici- 
pated work environment. 

skewness the symmetry or asymmetry of a frequency 
distribution; positive skew indicates that scores are 
piled up at the low end and negative skew indicates 
that scores are piled up at the high end. 

social desirability response set the tendency of ex- 
aminees to react to the perceived desirability (or un- 
desirability) of a test item rather than responding 
accurately to its content. 

social intelligence the capacity to understand other 
people and to relate effectively to them. 

source traits the stable and constant sources of behav- 
ior that are less visible than surface traits but more 
important in accounting for behavior. 

Spearman-Brown formula a formula for adjusting 
split-half correlations so that they reflect the full 
length of a scale. 

specific factor according to Spearman, a factor of in- 
telligence specific to an individual test. 

speed test a timed test that contains items of uniform 
and generally simple level of difficulty; the time limit 
is strict enough that few subjects finish a speed test. 
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split-half reliability a form of reliability in which 
scores from the two halves of a test (e.g., even items 
versus odd items) are correlated with one another; 
the correlation is then adjusted for test length. 

standard deviation a statistical index that reflects the 
degree of dispersion in a group of scores; the square 
root of the variance. 

standard error of measurement an index of mea- 
surement error which indicates the extent to which 
an examinee’s score might vary over a number of 
parallel tests. 

standard error of the difference a statistical index 
that can help a test user determine whether, for an in- 
dividual examinee, the difference between scores on 
two tests or subtests is significant. 

standard error of estimate’ SE, is the margin of error 

to be expected in the predicted criterion score; the 

error of estimate is derived from the following 

formula: 


SE... =SD,V1—r,? 


standard of care the standard of care that is usual, cus- 
tomary, or reasonable. 

standard score a transformed score in which the orig- 
inal score is expressed as the distance from the mean 
in standard deviation units. — 

standardization fallacy the fallacious view that a test 
standardized on one population is ipso facto unfair 
when used in any other population. 

standardization sample a large and representative 
group of subjects representative of the population for 
whom the test is intended. 

standardized procedure in testing, the attempt 
through carefully written instructions to ensure that 
the procedures for administering a test are uniform 
from one examiner and setting to another. 

stanine scale a scale in which all raw scores are con- 
verted to a single-digit system of scores ranging from 
1 to 9. 

state anxiety the transitory feelings of fear or worry 
that most persons experience on occasion. 

stenscale a 10-unit scale with five units above and five 
units below the mean. j 

stereotype threat the threat of confirming, as self-char- 
acteristic, a negative stereotype about one’s group. 

stratified random sampling a selection strategy in 
which subjects are chosen randomly, with the con- 
straint that the sample matches the population on rel- 
evant background variables such as race, sex, 
occupation, and so on. 


subgroup norms norms derived from an identified 
subgroup, as opposed to a diversified national 
sample. 

successive processing a form of information process- 
ing in which a proper sequence of mental operations 
must be followed. 

superego in psychoanalytic theory, that part of per- 
sonality that is roughly synonymous with conscience 
and comprises the societal standards of right and 
wrong that are conveyed to us by our parents. 

surface traits in Cattell’s theory, the more obvious as- 
pects of personality that typically emerge in the first 
stages of factor analysis when individual test items 
are correlated with each other. 

systematic measurement error a type of measure- 
ment error that arises when, unknown to the test de- 
veloper, a test consistently measures something other 
than the trait for which it was intended. 

table of specifications in test development, a table that 
lists the exact number of items in relevant content 
areas; such a table also specifies the precise number 
of items which must embody different cognitive 
processes. 

technical manual in testing, the manual that summa- 
rizes the technical data about a new instrument. 

temporal lobe the part of the cerebral cortex involved 
in processing of auditory sensations, long-term 
memory storage, and modulation of biological drives 
such as aggression, fear, and sexuality. 

teratogen a substance that crosses the placental barrier 
and causes physical deformities in the fetus. 

test a standardized procedure for sampling behavior 
and describing it with categories or scores. In addi- 
tion, most tests have norms or standards by which 
the results can be used to predict other, more impor- 
tant, behaviors. 

test anxiety a constellation of phenomenological, 
physiological, and behavioral responses that accom- 
pany concern about possible failure on a test. 

test bias in popular usage, a test is biased if it discrim- 
inates unfairly against racial and ethnic minorities, 
women, and the poor; technically, test bias refers to 
differential validity for definable, relevant subgroups 
of persons. 

test fairness the extent to which the social conse- 
quences of test usage are considered fair to relevant 
subgroups; a matter of social values, test fairness is 
especially pertinent when tests are used for selection 
decisions. 

test-retest reliability a form of reliability in which the 
same test is given twice to the same group of het- 


erogeneous and representative subjects; scores for 
the two sessions are then correlated. 

thalamus a key structure that provides sensory input 
and information about ongoing movement to the 
cerebral cortex; the thalamus is the major relay sta- 
tion in the brain. 

token economy a behavioral approach in which many 
different forms of prosocial behavior are rewarded 
with tokens that can be later exchanged for material 
rewards or privileges. 

trait- any relatively enduring way in which one indi- 
vidual differs from another. 

trait anxiety the relatively stable tendency of an indi- 
vidual to respond anxiously to a stressful predicament. 

true score an examinee’s hypothetical real score on a 
test; the true score can be estimated probabilistically, 
but is never directly known. 

T score a transformed score with mean of 50 and stan- 
dard deviation of 10. 

Type A coronary-prone behavior pattern a behavior 
pattern consisting of insecurity of status, hyperag- 
gressiveness, free-floating hostility, and a sense of 
time urgency (hurry sickness). 

unqualified individualism in testing for selection, 
the ethical stance that, without exception, the best- 
qualified candidates should be selected for employ- 
ment, admission, or other privilege. 

user’s manual in testing, the manual that gives in- 
structions for administration and also provides 
guidelines for test interpretation. 
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validity a test is valid to the extent that inferences made 
from it are appropriate, meaningful, and useful. 

validity coefficient the correlation between test and 
criterion (r,,). 

validity shrinkage the common discovery in cross- 
validation research that a test predicts the relevant 
criterion less accurately with the new sample of ex- 
aminees than with the original tryout sample. 

value. according to Rokeach and others, a shared and 
enduring belief about ideal modes of behavior or end 
states of existence. 

variance a statistical index that reflects the degree of 
dispersion in a group of scores. 

ventricles fluid-filled caverns within the brain. 

verbal comprehension the first factor on the WISC- 
III consisting of Information, Similarities, Vocabu- 
lary, and Comprehension. 

virtual reality the use of sophisticated computer im- 
ages projected to wrap-around goggles to portray a 
moving, changing, three-dimensional environment. 

visual agnosia a difficulty in the recognition of draw- 
ings, objects, or faces caused by brain damage. 

work sample an assessment procedure that uses a 
miniature replica of the job for which examinees 
have applied. 

work values the needs, motives, and values that influ- 
ence vocational choice, job satisfaction, and career 
development. 
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quotas and, 250 
unqualified individualism and, 249 
Test of Everyday Attention, 331-332 
Test of Nonverbal Intelligence-3, 226-227 
Test of Temporal Orientation, 349 


Tests of Achievement and Proficiency (TAP), 296 
Tests of General Educational Development (GED), 296 


Thematic Apperception Test (TAT), 25, 510-512 
Theory of multiple intelligences, 152-153 
Thompson-TAT, 514 

Thurstone Personality Schedule 24 
Tilting-room-tilting-chair tests, 39 


Time Urgency and Perpetual Activation (TUPA) Scale, 525 


Tinkertoy Test, 343 
Token economy, 548 


Token Test, 339 

Trait, 495 

Traumatic brain injury, 319 

Triarchic theory of intelligence, 154-156 

True score, 32, 78 

T score, 66-67 

Type A coronary-prone behavior pattern, 491-492, 524-526 


Uniform Guidelines on Employee Selection, 436-437 
United States v. Georgia Power, 434 

Unqualified individualism (test fairness), 249 

User’s manual, 136 


Validity, 97-115 
concurrent, 100, 102 
construct, 108-114 
content, 98-100 
criterion-related, 100-103 
face, 100 
predictive, 102 
widening scope of, 114-115 
Validity coefficient, 103 
Validity shrinkage, 133 
Value, 442 
Values Scale, 462 
Variance, 61 
Vineland Adaptive Behavior Scales (VABS), 236-237 
Violence, prediction of, 388-390 
Virtual reality, 569 
Visual agnosia, 315 
Visual interaction, 553 
Vocabulary (Wechsler subtest), 183 
Vocational Preference Inventory (VPI), 451-453 


Washington University Sentence Completion Test, 508 

Watson v. Fort Worth Bank and Trust, 437-438 

Wechsler Adult Intelligence Scale-III (WAIS-III) 189-192 

Wechsler-Bellevue Intelligence Scales, 62 

Wechsler Individual Achievement Test-II (WIAT-I), 365 

Wechsler Intelligence Scale for Children- (WISC-III) 
192-194 

Wechsler Memory Scale-III, 335-336 

Wechsler Preschool and Primary Scale of Intelligence- 
Revised (WPPSI-R), 164-165 

Wernicke’s area, 321 

Western Aphasia Battery, 339 

Wide Range Achievement Test-III, 365 

Wisconsin Card Sorting Test, 343 

Wonderlic Personnel Test, 410-411 

Woodcock-Johnson Psycho-Educational Battery (WJ-R), 365 

Work sample, 419-422 

Work values, 459 

Work Values Inventory, 461-462 


